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Abstract 

Boosting combines weak classifiers to form highly accurate predictors. Although the case of 
binary classification is well understood, in the multiclass setting, the "correct" requirements 
on the weak classifier, or the notion of the most efficient boosting algorithms are missing. 
In this paper, we create a broad and general framework, within which we make precise 
and identify the optimal requirements on the weak-classifier, as well as design the most 
effective, in a certain sense, boosting algorithms that assume such requirements. 
Keywords: Multiclass, boosting, weak learning condition, drifting games 

1. Introduction 

Boosting (Schapire and Freund, 2012) refers to a general technique of combining rules of 
thumb, or weak classifiers, to form highly accurate combined classifiers. Minimal demands 
are placed on the weak classifiers, so that a variety of learning algorithms, also called 
weak-learners, can be employed to discover these simple rules, making the algorithm widely 
applicable. The theory of boosting is well-developed for the case of binary classification. 
In particular, the exact requirements on the weak classifiers in this setting are known: any 
algorithm that predicts better than random on any distribution over the training set is said 
to satisfy the weak learning assumption. Further, boosting algorithms that minimize loss 
as efficiently as possible have been designed. Specifically, it is known that the Boost-by- 
majority (Freund, 1995) algorithm is optimal in a certain sense, and that AdaBoost (Freund 
and Schapire, 1997) is a practical approximation. 

Such an understanding would be desirable in the multiclass setting as well, since many 
natural classification problems involve more than two labels, e.g. recognizing a digit from 
its image, natural language processing tasks such as part-of-speech tagging, and object 
recognition in vision. However, for such multiclass problems, a complete theoretical un- 
derstanding of boosting is lacking. In particular, we do not know the "correct" way to 
define the requirements on the weak classifiers, nor has the notion of optimal boosting been 
explored in the multiclass setting. 

Straightforward extensions of the binary weak-learning condition to multiclass do not 
work. Requiring less error than random guessing on every distribution, as in the binary case, 
turns out to be too weak for boosting to be possible when there are more than two labels. 
On the other hand, requiring more than 50% accuracy even when the number of labels is 
much larger than two is too stringent, and simple weak classifiers like decision stumps fail 
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to meet this criterion, even though they often can be combined to produce highly accurate 
classifiers (Freund and Schapire, 1996a). The most common approaches so far have relied 
on reductions to binary classification (Allwein et al., 2000), but it is hardly clear that the 
weak-learning conditions implicitly assumed by such reductions are the most appropriate. 

The purpose of a weak-learning condition is to clarify the goal of the weak-learner, 
thus aiding in its design, while providing a specific minimal guarantee on performance 
that can be exploited by a boosting algorithm. These considerations may significantly 
impact learning and generalization because knowing the correct weak-learning conditions 
might allow the use of simpler weak classifiers, which in turn can help prevent overfitting. 
Furthermore, boosting algorithms that more efficiently and effectively minimize training 
error may prevent underfitting, which can also be important. 

In this paper, we create a broad and general framework for studying multiclass boosting 
that formalizes the interaction between the boosting algorithm and the weak-learner. Unlike 
much, but not all, of the previous work on multiclass boosting, we focus specifically on the 
most natural, and perhaps weakest, case in which the weak classifiers are genuine classifiers 
in the sense of predicting a single multiclass label for each instance. Our new framework 
allows us to express a range of weak-learning conditions, both new ones and most of the 
ones that had previously been assumed (often only implicitly). Within this formalism, we 
can also now finally make precise what is meant by correct weak-learning conditions that 
are neither too weak nor too strong. 

We focus particularly on a family of novel weak-learning conditions that have an es- 
pecially appealing form: like the binary conditions, they require performance that is only 
slightly better than random guessing, though with respect to performance measures that 
are more general than ordinary classification error. We introduce a whole family of such 
conditions since there are many ways of randomly guessing on more than two labels, a key 
difference between the binary and multiclass settings. Although these conditions impose 
seemingly mild demands on the weak-learner, we show that each one of them is powerful 
enough to guarantee boostability, meaning that some combination of the weak classifiers has 
high accuracy. And while no individual member of the family is necessary for boostability, 
we also show that the entire family taken together is necessary in the sense that for every 
boostable learning problem, there exists one member of the family that is satisfied. Thus, 
we have identified a family of conditions which, as a whole, is necessary and sufficient for 
multiclass boosting. Moreover, we can combine the entire family into a single weak-learning 
condition that is necessary and sufficient by taking a kind of union, or logical OR, of all the 
members. This combined condition can also be expressed in om framework. 

With this understanding, we are able to characterize previously studied weak-learning 
conditions. In particular, the condition implicitly used by AdaBoost.MH (Schapire and 
Singer, 1999), which is based on a onc-against-all reduction to binary, turns out to be strictly 
stronger than necessary for boostability. This also applies to AdaBoost.Ml (Freund and 
Schapire, 1996a), the most direct generalization of AdaBoost to multiclass, whose conditions 
can be shown to be equivalent to those of AdaBoost.MH in our setting. On the other hand, 
the condition implicit to the SAMME algorithm by Zhu et al. (2009) is too weak in the 
sense that even when the condition is satisfied, no boosting algorithm can guarantee to 
drive down the training error. Finally, the condition implicit to AdaBoost. MR (Schapire 
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and Singer, 1999; Frcund and Schapire, 1996a) (also called AdaBoost.M2) turns out to be 
exactly necessary and sufficient for boostability. 

Employing proper weak-learning conditions is important, but we also need boosting 
algorithms that can exploit these conditions to effectively drive down error. For a given 
weak-learning condition, the boosting algorithm that drives down training error most effi- 
ciently in our framework can be understood as the optimal strategy for playing a certain 
two-player game. These games are non-trivial to analyze. However, using the powerful ma- 
chinery of drifting games (Freund and Opper, 2002; Schapire, 2001), we are able to compute 
the optimal strategy for the games arising out of each weak-learning condition in the family 
described above. Compared to earlier work, our optimality results hold more generally 
and also achieve tighter bounds. These optimal strategies have a natural interpretation in 
terms of random walks, a phenomenon that has been observed in other settings (Abernethy 
et al., 2008; Freund, 1995). 

We also analyze the optimal boosting strategy when using the minimal weak learning 
condition, and this poses additional challenges. Firstly, the minimal weak learning condition 
has multiple natural formulations — e.g., as the union of all the conditions in the family 
described above, or the formulation used in AdaBoost.MR — and each formulation leading 
to a different game specification. A priori, it is not clear which game would lead to the 
best strategy. We resolve this dilemma by proving that the optimal strategies arising out 
of different formulations of the same weak learning condition lead to algorithms that are 
essentially equally good, and therefore we are free to choose whichever formulation leads 
to an easier analysis without fear of suffering in performance. We choose the union of 
conditions formulation, since it leads to strategies that share the same interpretation in 
terms of random walks as before. However, even with this choice, the resulting games 
are hard to analyze, and although we can explicitly compute the optimum strategies in 
general, the computational complexity is usually exponential. Nevertheless, we identify key 
situations under which efficient computation is possible. 

The game-theoretic strategies are non-adaptive in that they presume prior knowledge 
about the edge, that is, how much better than random are the weak classifiers. Algorithms 
that are adaptive, such as AdaBoost, are much more practical because they do not require 
such prior information. We show therefore how to derive an adaptive boosting algorithm 
by modifying the game-theoretic strategy based on the minimal condition. This algorithm 
enjoys a number of theoretical guarantees. Unlike some of the non-adaptive strategies, it 
is efficiently computable, and since it is based on the minimal weak learning condition, it 
makes minimal assumptions. In fact, whenever presented with a boostable learning problem, 
this algorithm can approach zero training error at an exponential rate. More importantly, 
the algorithm is effective even beyond the boostability framework. In particular, we show 
empirical consistency, i.e., the algorithm always converges to the minimum of a certain 
exponential loss over the training data, whether or not the dataset is boostable. Further- 
more, using the results in (Mukherjee et al., 2011) we can show that this convergence occurs 
rapidly. 

Our focus in this paper is only on minimizing training error, which, for the algorithms 
we derive, provably decreases exponentially fast with the number of rounds of boosting 
under boostability assumptions. Such results can be used in turn to derive bounds on the 
generalization error using standard techniques that have been applied to other boosting 
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algorithms (Schapire et al., 1998; Preund and Schapire, 1997; Koltchinskii and Panchenko, 
2002). Consistency in the multiclass classification setting has been studied by Tcwari and 
Bartlett (2007) and has been shown to be trickier than binary classification consistency. 
Nonetheless, by following the approach in (Bartlett and Traskin, 2007) for showing con- 
sistency in the binary setting, we are able to extend the empirical consistency guarantees 
to general consistency guarantees in the multiclass setting: we show that under certain 
conditions and with sufficient data, our adaptive algorithm approaches the Bayes-optimum 
error on the test dataset. 

We present experiments aimed at testing the efficacy of the adaptive algorithm when 
working with a very weak weak-learner to check that the conditions we have identified are 
indeed weaker than others that had previously been used. We find that our new adaptive 
strategy achieves low test error compared to other multiclass boosting algorithms which 
usually heavily underfit. This validates the potential practical benefit of a better theoretical 
understanding of multiclass boosting. 

Previous work. The first boosting algorithms were given by Schapire (1990) and Pre- 
und (1995), followed by their AdaBoost algorithm (Preund and Schapire, 1997). Multiclass 
boosting techniques include AdaBoost. Ml and AdaBoost. M2 (Preund and Schapire, 1997), 
as well as AdaBoost. MH and AdaBoost. MR (Schapire and Singer, 1999). Other approaches 
include the work by Eibl and Pfeiffer (2005); Zhu et al. (2009). There are also more general 
approaches that can be applied to boosting including (Allwein et al., 2000; Beygelzimer 
et al., 2009; Dietterich and Bakiri, 1995; Hastie and Tibshirani, 1998). Two game-theoretic 
perspectives have been applied to boosting. The first one (Preund and Schapire, 1996b; 
Ratsch and Warmuth, 2005) views the weak-learning condition as a minimax game, while 
drifting games (Schapire, 2001; Preund, 1995) were designed to analyze the most efficient 
boosting algorithms. These games have been further analyzed in the multiclass and contin- 
uous time setting in (Preund and Opper, 2002). 

2. Framework 

We introduce some notation. Unless otherwise stated, matrices will be denoted by bold 
capital letters like M, and vectors by bold small letters like v. Entries of a matrix and 
vector will be denoted as M(i,j) or v{i), while M(i) will denote the ith row of a matrix. 
Inner product of two vectors u, v is denoted by (u, v). The Probenius inner product of 
two matrices Tr(MM') will be denoted by M • M', where M' is the transpose of M. The 
indicator function is denoted by 1 [•]. The set of all distributions over the set {1, . . . , fc} will 
be denoted by A {1, . . . , A;}, and in general, the set of all distributions over any set S will 
be denoted by A{S). 

In multiclass classification, we want to predict the labels of examples lying in some set 
X. We are provided a training set of labeled examples {(xi, j/i), . . . , {xm, Um)}, where each 
example Xi e X has a label yi in the set {1, . . . , k}. 

Boosting combines several mildly powerful predictors, called weak classifiers, to form 
a highly accurate combined classifier, and has been previously applied for multiclass clas- 
sification. In this paper, we only allow weak classifier that predict a single class for each 
example. This is appealing, since the combined classifier has the same form, although it 
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differs from what has been used in much previous work. Later we will expand our framework 
to include multilabel weak classifiers, that may predict multiple labels per example. 

We adopt a game-theoretic view of boosting. A game is played between two players, 
Booster and Weak-Learner, for a fixed number of rounds T. With binary labels, Booster 
outputs a distribution in each round, and Weak-Learner returns a weak classifier achieving 
more than 50% accuracy on that distribution. The multiclass game is an extension of the 
binary game. In particular, in each round t: 

• Booster creates a cost-matrix Ct G M'"^*^, specifying to Weak-Learner that the cost 
of classifying example Xi as / is Ct{i,l). The cost-matrix may not be arbitrary, but 
should conform to certain restrictions as discussed below. 

• Weak-Learner returns some weak classifier ht'. X {!,..., fc} from a fixed space 
ht so that the cost incurred is 

m 
i=l 

is "small enough" , according to some conditions discussed below. Here by 1^ we mean 
the mx k matrix whose (z, j)-th entry is 1 [h{i) = j]. 

• Booster computes a weight at for the current weak classifier based on how much cost 
was incurred in this round. 

At the end. Booster predicts according to the weighted plurality vote of the classifiers 
returned in each round: 

T 

Il{x) = argmax frix, I), where fxix, I) = 1 [ht{x) = /] at- (1) 

ie{i,...M 

By carefully choosing the cost matrices in each round. Booster aims to minimize the training 
error of the final classifer H, even when Weak-Learner is adversarial. The restrictions 
on cost-matrices created by Booster, and the maximum cost Weak-Learner can suffer in 
each round, together define the weak-learning condition being used. For binary labels, the 
traditional weak-learning condition states: for any non-negative weights w{l), . . . , w{m) on 
the training set, the error of the weak classfier returned is at most (1/2 — 7/2) Wi. Here 
7 parametrizes the condition. There are many ways to translate this condition into our 
language. The one with fewest restrictions on the cost-matrices requires labeling correctly 
should be less costly than labeling incorrectly: 

Vz : C{i,yi) < C{i,yi) (here yi ^ yi is the other binary label), 

while the restriction on the returned weak classifier h requires less cost than predicting 
randomly: 

c{i, h{x,)) < I Q - 2) c{i, m) + + 1) C{i, y,) } . 
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By the correspondence w{i) = C{i, yi) — C(i, yi), we may verify the two conditions are the 
same. 

We will rewrite this condition after making some simplifying assumptions. Henceforth, 
without loss of generality, we assume that the true label is always 1. Let C'^™ C M"^^^ 
consist of matrices C which satisfy C(i, 1) < C(i,2). Further, let U^'" G M"*^^ be the 
matrix whose each row is (1/2 + 7/2, 1/2 — 7/2). Then, Weak-Learner searching space H 
satisfies the binary weak-learning condition if: VC G C^^^,3h £ % : C • {ih — U!^™) < 0. 
There are two main benefits to this reformulation. With linear homogeneous constraints, 
the mathematics is simplified, as will be apparent later. More importantly, by varying the 
restrictions C^™ on the cost vectors and the matrix U^'", we can generate a vast variety of 
weak-learning conditions for the multiclass setting A; > 2 as we now show. 

Let C C K»n-xfc and let B G ^^^^ ^ matrix which we call the baseline. We say a weak 
classifier space H satisfies the condition (C,B) if 

m m 

VCGC,3/tGH: C»(1^-B)<0, i.e., ^ C(i, < ^ (C(i), B(i)) . (2) 

In (2), the variable matrix C specifies how costly each misclassification is, while the baseline 
B specifies a weight for each misclassification. The condition therefore states that a weak 
classifier should not exceed the average cost when weighted according to baseline B. This 
large class of weak-learning conditions captures many previously used conditions, such as 
the ones used by AdaBoost.Ml (Preund and Schapire, 1996a), AdaBoost.MH (Schapire and 
Singer, 1999) and AdaBoost.MR (Freund and Schapire, 1996a; Schapire and Singer, 1999) 
(see below), as well as novel conditions introduced in the next section. 

By studying this vast class of weak-learning conditions, we hope to find the one that 
will serve the main purpose of the boosting game: finding a convex combination of weak 
classifiers that has zero training error. For this to be possible, at the minimum the 
weak classifiers should be sufficiently rich for such a perfect combination to exist. For- 
mally, a collection % of weak classifiers is boostable if it is eligible for boosting in the 
sense that there exists a distribution A on this space that linearly separates the data: 

: argmax;g|]^ ^1 X^^^^^ A(^)l [h{xi) = I] = yi- The weak-learning condition plays two 
roles. It rejects spaces that are not boostable, and provides an algorithmic means of search- 
ing for the right combination. Ideally, the second factor will not cause the weak-learning 
condition to impose additional restrictions on the weak classifiers; in that case, the weak- 
learning condition is merely a reformulation of being boostable that is more appropriate 
for deriving an algorithm. In general, it could be too strong, i.e. certain boostable spaces 
will fail to satisfy the conditions. Or it could be too weak i.e., non-boostable spaces might 
satisfy such a condition. Booster strategies relying on either of these conditions will fail to 
drive down error, the former due to underfitting, and the latter due to overfitting. Later 
we will describe conditions captured by our framework that avoid being too weak or too 
strong. But before that, we show in the next section how our flexible framework captures 
weak learning conditions that have appeared previously in the literature. 
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3. Old conditions 

In this section, we rewrite, in the language of our framework, the weak learning condi- 
tions explicitly or implicitly employed in the multiclass boosting algorithms SAMME (Zhu 
et al., 2009), AdaBoost.Ml (Freund and Schapire, 1996a), and AdaBoost.MH and Ad- 
aBoost.MR (Schapire and Singer, 1999). This will be useful later on for comparing the 
strengths and weaknesses of the various conditions. We will end this section with a curious 
equivalence between the conditions of AdaBoost.MH and AdaBoost.Ml. 

Recall that we have assumed the correct label is 1 for every example. Nevertheless, we 
continue to use yi to denote the correct label in this section. 



3.1 Old conditions in the new framework 

Here we restate, in the language of our new framework, the weak learning conditions of four 
algorithms that have earlier appeared in the literature. 

SAMME. The SAMME algorithm (Zhu et al., 2009) requires less error than random 
guessing on any distribution on the examples. Formally, a space T-L satisfies the condition 
if there is a 7' > such that, 

m m 

Vd(l), . . . , d{m) >0,3heH:J2 "^^^^ [^(^') ^ ^ " - tO Y1 

i=l i=l 

Define a cost matrix C whose entries are given by 



if i = Vi- 



Cii,j) 

Then the left hand side of (3) can be written as 

m 
i=l 

Next let 7 = (1 — l/k)j' and define baseline to be the multiclass extension of U*^™, 

r (1-7) 

U^{i,l) 

Then the right hand side of (3) can be written as 



^+7 in = y. 



^^C(i,0i7^(i,0 = C.U^, 

since C{i,yi) = for every example i. Define C^^^ to be the following collection of cost 
matrices: 

^SAM A I . ^^.^ ^ 1 dl = yi, no^.^egative ti,...,tm.\ 

\ti lily^yi, 
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Using the last two equations, (3) is equivalent to 

VC G C^^^, 3/t G H : C • (1ft - U^) < 0. 



Therefore, the weak- learning condition of SAMME is given by (C , U^). 

AdaBoost.Ml Adaboost.Ml (Freund and Schapire, 1997) measures the performance of 
weak classifiers using ordinary error. It requires 1/2 + 7/2 accuracy with respect to any 
non-negative weights d{l), ... , d{m) on the training set: 



i=l i=l 

m m 

i.e. Y,d{i)ih{xi) ^ yi\ < -7 5^c?(0- 



(4) 



i=l 



i=l 



where [-I is the ±1 indicator function, taking value +1 when its argument is true, and — 1 
when false. Using the transformation 



C{i,l) = llj^y4d{i) 

we may rewrite (5) as 

yC G W^'"' satisfying < -C{i, yi) = C{i, I) for I 7^ 

m m 

3hen:Y, C{i, h{x,)) < 7 X yi) 

i=l i=l 

i.e. VCGC^\3/iGH:C«(lft-B^i) <0, 



(5) 
(6) 

(7) 



where B^-'^(i,Z) = 7I [Z = yj], and C W^^^ consists of matrices satisfying the con- 
straints in (6). 

AdaBoost.MH AdaBoost.MH (Schapire and Singer, 1999) is a popular multiclass boost- 
ing algorithm that is based on the one-against-all reduction, and was originally designed to 
use weak-hypotheses that return a prediction for every example and every label. The im- 
plicit weak learning condition requires that for any matrix with non-negative entries d(i, /), 
the weak-hypothesis should achieve 1/2 -|- 7 accuracy 



E 

i=l 



1 [h{xi) ^ yi] d{i, yi) + ^l [h{xi) = I] d{i, I) 



> < 



1_2 

2 2 



m k 



=1 1=1 



(8) 



This can be rewritten as 

E { -1 Hxi) = y^] d{i, yi) + E 1 H^t) = I] d{i, I) 



i=l 



i=l 



1_I 

2 2 



Y^dii,l)-(^ + A d{i,yi) 



> . 
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Using the mapping 

[-d{t,l) ifl = yi, 
their weak-learning condition may be rewritten as follows 

VC G W^""" satisfying C{i,yi) < 0, C{i, Z) > for / 7^ yi, 

f2 C{z, h{x,)) < f; H 1 + 2^ C{i, yi) + [1 - I W C{i, I) 

1=1 i=l [ V / V / l^yi 



(9) 



(10) 



Defining to be the space of all cost matrices satisfying the constraints in (9), the above 
condition is the same as 

VC G C^^, 3hen:C»{lh- bJ^") < 0, 
where BjfH(z, /) = (1/2 + 7p = y,l/2). 

AdaBoost.MR AdaBoost.MR (Schapire and Singer, 1999) is based on the all-pairs mul- 
ticlass to binary reduction. Like AdaBoost.MH, it was originally designed to use weak- 
hypotheses that return a prediction for every example and every label. The weak learning 
condition for AdaBoost.MR requires that for any non-negative cost-vectors {d{i, O}/^?/,) the 
weak-hypothesis returned should satisfy the following: 

m m 

^^(i[Ma;.) = i]-i[M^O = y^])^^(^,0 < -71;E'^(^'0 



I.e. 



E 



-1 [h{xi) = Vi] d{i, + 2^ 1 [h{xi) = I] d{i, I) 



»=1 ly^Vi 

m 

i=l l^Vi 



Substituting 



C{i,l) = 



d{i,l) 



T.i^y,d{i,l) l = yi, 



we may rewrite AdaBoost.MR's weak-learning condition as 

VC G M"*^^ satisfying C{i, /) > for / 7^ y^, C{i, y^) = - ^ C{i, I), 



(11) 



3he'H:J2C{i,h{xi))<-^Yl 



i=l 



i=l 



-C{i,yi) + Ycii,l)\. 



Defining to be the collection of cost matrices satisfying the constraints in (11), the 
above condition is the same as 

VCGC^^3/lGH:C.(l^-B^f^) <0, 



where B}f^{i,l) = ll = yijj/2. 
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3.2 A curious equivalence 

We show that the weak learning conditions of AdaBoost.MH and AdaBoost.Ml are identical 
in our framework. This is surprising because the original motivations behind these algo- 
rithms were completely different. AdaBoost.Ml is a direct extension of binary AdaBoost 
to the multiclass setting, whereas AdaBoost.MH is based on the one-against-all multiclass 
to binary reduction. This equivalence is a sort of degeneracy, and arises because the weak 
classifiers being used predict single labels per example. With multilabel weak classifiers, for 
which AdaBoost.MH was originally designed, the equivalence no longer holds. 

The proofs in this and later sections will make use of the following minimax result, that 
is a weaker version of Corollary 37.3.2 of (Rockafellar, 1970). 

Theorem 1 (Minimax Theorem) Let C,D be non-empty closed convex subsets ofW^,W^ 
respectively, and let K be a linear function on C x D. If either C or D is bounded, then 

min max K(u,v) = max mm K(u,v). 

veD u&c uec veD 

Lemma 2 A weak classifier space H satisfies {C^^,'B^^) if and only if it satisfies {C^^, B;^^). 

Proof We will refer to {C^\B^^^) by Ml and {C^^,B!f^) by MH for brevity. The proof 
is in three steps. 

Step (i): If 'H satisfies MH, then it also satisfies Ml. This follows since any constraint 
(4) imposed by Ml on T-L can be reproduced by MH by plugging the following values of 

d{i, I) in (8) 

d{i) \il = yi 
if i / 2/,. 



d{i,l) 



Step (ii): If H satisfies Ml, then there is a convex combination H^* of the matrices 
111 G7i, defined as 

HA* = i;A*(/i)l,, 
hen 

such that 

vMH..-Br)(..o{^° (12) 

Indeed, Theorem 1 yields 

min max C« (Ha -B^M = max minC. (l/j -B^M < 0, (13) 
AeA(H)CeCMi ^ ^ ^ ceCMi/ieH ^ ^ ^ 

where the inequality is a restatement of our assumption that H satisfies Ml. If A* is a 
minimizer of the minmax expression, then H;^* must satisfy 



Vi:H;,.(z,0<!-?^^ if^-y^ (14) 
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or else some choice of C G can cause C • (H;^* — B*^^) to exceed 0. In particular, if 
H;,.(io,0 < 1/2 + 7/2, then 

(H;,. - Bf ) (zo,y.o) < E (Hv - Bf ) (ioj). 
Now, if we choose C G as 

lit ^ iQ 

1 if i = io, ; 7^ 

^-1 if i = io,/ = yio> 



C{i,l) 
then. 



C . (H;,* - B^i) = - (H;,* - Bjf 1) {io, y^o) + E (^a* - K^) {io, I) > 0, 

contradicting the inequality in (13). Therefore (14) holds. Eqn. (12), and thus Step (ii), 
now follows by observing that B^^, by definition, satisfies 

Vz:B^H(M) = |f + ? = 



2 2 



if I / Vi- 



Step (in) If there is some convex combination T-L\* satisfying (12), then Ti satisfies 
MH. Recall that B'^^ consists of entries that are non-positive on the correct labels and 
non-negative for incorrect labels. Therefore, (12) implies 

> max C • (Ha* - Bjf") > min max C • (H^ - B^^^) . 
cecMH ^ ^ AeACH)ceCMH ^ ^ ' 

On the other hand, using Theorem 1 we have 
min max 

Combining the two, we get 



min max C • (Ha - B^") = max min C • (l^ - B^^) . 



> max min C • (l,, - Bif , 

which is the same as saying that H satisfies MH's condition. 

Steps (ii) and (iii) together imply that if H satisfies Ml, then it also satisfies MH. Along 
with Step (i), this concludes the proof. ■ 



4. Necessary and sufficient weeLk-leeirning conditions 

The binary weak-learning condition has an appealing form: for any distribution over the 
examples, the weak classifier needs to achieve error not greater than that of a random player 
who guesses the correct answer with probability 1/2 + 7/2. Further, this is the weakest con- 
dition under which boosting is possible as follows from a game-theoretic perspective (Preund 
and Schapire, 1996b; Ratsch and Warmuth, 2005) . Multiclass weak-learning conditions with 
similar properties are missing in the literature. In this section we show how our framework 
captures such conditions. 
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4.1 Edge-over- random conditions 

In the multiclass setting, we model a random player as a baseline predictor B G M"*^'^ whose 
rows are distributions over the labels, B(f) G A {1, . . . , k}. The prediction on example f is a 
sample from B(i). We only consider the space of edge-over-random baselines B^"'^ C K™x*: 
who have a faint clue about the correct answer. More precisely, any baseline B G B^°^ 
in this space is 7 more likely to predict the correct label than an incorrect one on every 
example i: VZ 7^ 1, B{i, 1) > B{i, I) + 7, with equality holding for some I, i.e.: 

B{i, 1) = max {B{i, + 7 : / 7^ 1} • 

Notice that the edge-over-random baselines are different from the baselines used by earlier 
weak learning conditions discussed in the previous section. 

When k = 2, the space consists of the unique player U^'", and the binary weak- 
learning condition is given by (C^™,U^^'^). The new conditions generalize this to k > 2. In 
particular, define C*^°'' to be the multiclass extension of C''™: any cost-matrix in C^"^ should 
put the least cost on the correct label, i.e., the rows of the cost-matrices should come 
from the set (c G M*^ : VZ,c(l) < c{l)}. Then, for every baseline B G B^"^, we introduce 
the condition (C'^°'',B), which we call an edge-over-random weak-learning condition. Since 
C«B is the expected cost of the edge-over-random baseline B on matrix C, the constraints 
(2) imposed by the new condition essentially require better than random performance. 

Also recall that we have assumed that the true label yi of example i in our training set 
is always 1. Nevertheless, we may occasionally continue to refer to the true labels as yi. 

We now present the central results of this section. The seemingly mild edge-over-random 
conditions guarantee boostability, meaning weak classifiers that satisfy any one such condi- 
tion can be combined to form a highly accurate combined classifier. 

Theorem 3 (Sufficiency) // a weak classifier space % satisfies a weak-learning condition 
{C'^\'B), for some B G B^\ then H is boostable. 

Proof The proof is in the spirit of the ones in (Preund and Schapire, 1996b). Applying 
Theorem 1 yields 

> max min C • (Ih — B) = min max C • (H\ — B) , 

where the first inequality follows from the definition (2) of the weak-learning condition. Let 

A* be a minimizcr of the min-max expression. Unless the first entry of each row of (H;^* — B) 
is the largest, the right hand side of the min-max expression can be made arbitrarily large 
by choosing C G C""" appropriately. For example, if in some row i, the j^^ element is strictly 
larger than the first element, by choosing 



C(z,j) 



-1 ifi = i 
1 if i = jo 
otherwise. 



we get a matrix in C^"^ which causes C • (H;^* — B) to be equal to C{i,jo) — C{i, 1) > 0, 
an impossibility by the first inequality. 
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Therefore, the convex combination of the weak classifiers, obtained by choosing each 
weak classifier with weight given by A* , perfectly classifies the training data, in fact with a 
margin 7. ■ 

On the other hand, the family of such conditions, taken as a whole, is necessary for boosta- 
bility in the sense that every eligible space of weak classifiers satisfies some edge-over-random 
condition. 

Theorem 4 (Relaxed necessity) For every boostable weak classifier space %, there exists 
a 7 > and B G B^'^ such that % satisfies the weak-learning condition (C^"*", B). 

Proof The proof shows existence through non-constructive averaging arguments. We will 
reuse notation from the proof of Theorem 3 above. % is boostable implies there exists some 
distribution A* G A(H) such that 

Vj/l,i:Hv(i,l)-Hv(i,j) >0. 

Let 7 > be the minimum of the above expression over all possible (^, j), and let B = H_\n.. 
Then B G and 

max min C • ilh — B) < min max C • (H\ — B) < max C • (H\. — B) = 0, 

where the equality follows since by definition H_\. — B = 0. The max-min expression is at 
most zero is another way of saying that % satisfies the weak- learning condition (C^""^, B) as 
in (2). ■ 

Theorem 4 states that any boostable weak classifier space will satisfy some condition in 
our family, but it does not help us choose the right condition. Experiments in Section 10 
suggest (C'^"'', U^) is effective with very simple weak-learners compared to popular boosting 
algorithms. (Recall G B*^^ is the edge-over-random baseline closest to uniform; it has 
weight (1 — 7)/A; on incorrect labels and (1 — 7)/A;-|-7 on the correct label.) However, there 
are theoretical examples showing each condition in our family is too strong. 

Theorem 5 For any B G B^'^, there exists a boostable space % that fails to satisfy the 
condition (C^'"',B). 

Proof We provide, for any 7 > and edge-over-random baseline B G B^°^, a dataset and 

weak classifier space that is boostable but fails to satisfy the condition (C''°'^,B). 

Pick 7' = 7/fe and set m > I/7' so that [m(l/2 + 7')] > m/2. Our dataset will 
have m labeled examples {(0, yo), . . . ,{m — l,ym-i)}, and m weak classifiers. We want the 
following symmetries in our weak classifiers: 

• Each weak classifier correctly classifies [m(l/2 + 7')] examples and misclassifies the 
rest. 

• On each example, [m(l/2 + 7')] weak classifiers predict correctly. 

Note the second property implies boostability, since the uniform convex combination of all 
the weak classifiers is a perfect predictor. 
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The two properties can be satisfied by the following design. A window is a contiguous 
sequence of examples that may wrap around; for example 

{i, {i + 1) mod m, . . . , {i + k) mod m} 

is a window containing k elements, which may wrap around ii i + k > m. For each window 
of length [m(l/2 + 7')] create a hypothesis that correctly classifies within the window, 
and misclassifies outside. This weak-hypothesis space has size m, and has the required 
properties. 

We still have flexibility as to how the misclassifications occur, and which cost-matrix to 
use, which brings us to the next two choices: 

• Whenever a hypothesis misclassifies on example i, it predicts label 

yi = argmin {B{i, I) : I yi} . (15) 



• A cost-matrix is chosen so that the cost of predicting jji on example i is 1, but for any 
other prediction the cost is zero. Observe this cost-matrix belongs to C^°^. 

Therefore, every time a weak classifier predicts incorrectly, it also suffers cost 1. Since each 
weak classifier predicts correctly only within a window of length [m(l/2 + tOJj it suffers 
cost [m(l/2 — 7')] . On the other hand, by the choice of yi in (15), 

B{i,yi) = mm{B{i,l)-^,B{i,2),...,B{i,k)} 

< I {B{i, 1) - 7 + B{i, 2) + B{i, 3) + . . . + B{i, k)} 

rv 

= l/k--i/k. 

So the cost of B on the chosen cost-matrix is at most m{l/k — ^/k), which is less than the 
cost [m(l/2 — 7')] > m(l/2 — 7/A;) of any weak classifier whenever the number of labels k 
is more than two. Hence our boostable space of weak classifiers fails to satisfy {0^°^^, B). ■ 

Theorems 4 and 5 can be interpreted as follows. While a boostable space will satisfy some 
edge-over-random condition, without further information about the dataset it is not possible 
to know which particular condition will be satisfied. The kind of prior knowledge required 
to make this guess correctly is provided by Theorem 3: the appropriate weak learning 
condition is determined by the distribution of votes on the labels for each example that a 
target weak classifier combination might be able to get. Even with domain expertise, such 
knowledge may or may not be obtainable in practice before running boosting. We therefore 
need conditions that assume less. 



4.2 The minimal weak learning condition 

A perhaps extreme way of weakening the condition is by requiring the performance on a 
cost matrix to be competitive not with a fixed baseline B G 13^°'^, but with the worst of 
them: 

VC G C^°^ 3heH:C»lh< max C • B. (16) 
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Condition (16) states that during the course of the same boosting game, Weak-Learner may 
choose to beat any edge-over-random basehne B € 5^°'', possibly a different one for every 
round and every cost-matrix. This may superficiahy seem much too weak. On the contrary, 
this condition turns out to be equivalent to boostability. In other words, according to our 
criterion, it is neither too weak nor too strong as a weak-learning condition. However, 
unlike the edge-over-random conditions, it also turns out to be more difficult to work with 
algorithmically. 

Furthermore, this condition can be shown to be equivalent to the one used by Ad- 
aBoost.MR (Schapire and Singer, 1999; Preund and Schapire, 1996a). This is perhaps re- 
markable since the latter is based on the apparently completely unrelated all-pairs multiclass 
to binary reduction. In Section 3 we saw that the MR condition is given by (C'^''^, B^''^), 
where C^^^ consists of cost-matrices that put non-negative costs on incorrect labels and 
whose rows sum up to zero, while G W^^'^ is the matrix that has 7 on the first column 
and —7 on all other columns. Further, the MR condition, and hence (16), can be shown to 
be neither too weak nor too strong. 

Theorem 6 (MR) A weak classifier space % satisfies AdaBoost.MR's weak-learning con- 
dition (C^^,B;^^) if and only if it satisfies (16). Moreover, this condition is equivalent to 
being boostable. 

Proof We will show the following three conditions are equivalent: 
(A) H is boostable 



(C) 37 > such that VC G C^^, 3h e n : C • Ih < C • B^^. 

We will show (A) implies (B), (B) implies (C), and (C) implies (A) to achieve the above. 

(A) implies (B): Immediate from Theorem 2. 

(B) implies (C): Suppose (B) is satisfied with 27. We will show that this implies % 
satisfies (C^^,B^^). Notice C^^ C C^°\ Therefore it suffices to show that 



Notice that B G implies B' = B — B^^ is a matrix whose largest entry in each row is 
in the first column of that row. Then, for any C G C'^^, C • B' can be written as 



(B) 



37 > such that VC G C*^"'', 3/i G W : C • 1,, < max C • B 



VC G r'"^, B G K 



: C • (B - B}!^^) < 0. 



m k 



C»B' 



^^C(i,i) {B'{i,j)-B'{i,l)). 



1=1 j=2 



Since C{i,j) > for j > 1, and B'{i,j) - B' 
(C) implies (A): Applying Theorem 1 



(i, 1) < 0, we have our result. 




mm max 
AeACH) CeCMR 



C . (H;, - B^^) . 
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hi 


h2 


a 


1 


2 


b 


1 


2 



Figure 1: A weak classifier space which satisfies SAMME's weak learning condition but is not 
boostable. 

For any io and /o 7^ Ij the following cost- matrix C satisfies C G C^^, 

a i^iQ or I ^ {l,^o} 

1 i = io,l = Iq 
—1 if z = ig, 1 = 1. 

Let A belong to the argmin of the minmax expression. Then C • (H^ - B^^^) < implies 
11^(^0! 1) — H;^(zo,Zo) > 27. Since this is true for all io and Iq ^ 1, we conclude that the 
^^MR^ gMR-j condition implies boostability. 

This concludes the proof of equivalence. ■ 

Next, we illustrate the strengths of our minimal weak- learning condition through concrete 
comparisons with previous algorithms. 

Comparison with SAMME. The SAMME algorithm of Zhu et al. (2009) requires the 
weak classifiers to achieve less error than uniform random guessing for multiple labels; in our 
language, their weak- learning condition is {C^^^,\J^), as shown in Section 3, where C^^^ 
consists of cost matrices whose rows are of the form (0, t,t, . . .) for some non-negative t. As is 
well-known, this condition is not sufficient for boosting to be possible. In particular, consider 
the dataset {(a, 1), (b, 2)} with k = 3,m = 2, and a weak classifier space consisting of hi, /12 
which always predict 1,2, respectively (Figure 1). Since neither classifier distinguishes 
between a, b we cannot achieve perfect accuracy by combining them in any way. Yet, due 
to the constraints on the cost-matrix, one of /ii,/i2 will always manage non-positive cost 
while random always suffers positive cost. On the other hand our weak- learning condition 
allows the Booster to choose far richer cost matrices. In particular, when the cost matrix 
C G C^°^ is given by 





1 


2 


3 


a 


-1 


-fl 





b 


+1 


-1 


0, 



both classifiers in the above example suffer more loss than the random player U^, and fail 
to satisfy our condition. 

Comparison with AdaBoost.MH. AdaBoost.MH (Schapire and Singer, 1999) was de- 
signed for use with weak hypotheses that on each example return a prediction for every 
label. When used in our framework, where the weak classifiers return only a single mul- 

ticlass prediction per example, the implicit demands made by AdaBoost.MH on the weak 
classifier space turn out to be too strong. To demonstrate this, we construct a classifier 
space that satisfies the condition [C^°^, U^) in our family, but cannot satisfy AdaBoost.MH's 



C{i,l) = < 
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weak-learning condition. Note that this does not imply that the conditions are too strong 
when used with more powerful weak classifiers that return multilabel multiclass predictions. 

Consider a space % that has, for every (1/fe + 7)777, element subset of the examples, a 
classifier that predicts correctly on exactly those elements. The expected loss of a randomly 
chosen classifier from this space is the same as that of the random player U^. Hence T-L 
satisfies this weak-learning condition. On the other hand, it was shown in Section 3 that 
AdaBoost.MH's weak-learning condition is the pair (C^^, B^^), where consists of cost 
matrices with non-negative entries on incorrect labels and non-positive entries on real labels, 
and where each row of the matrix B^'^ is the vector (1/2 -|- 7/2, 1/2 - 7/2, . . . , 1/2 - 7/2). 
A quick calculation shows that for any h Ti, and C G (J^H ^^[^1^^ —1 in the first column 
and zeroes elsewhere, C • (l/, - B^^) = 1/2 - 1/A;. This is positive when k > 2, so that U 
fails to satisfy AdaBoost.MH's condition. 

We have seen how our framework allows us to capture the strengths and weaknesses of 
old conditions, describe a whole new family of conditions and also identify the condition 
making minimal assumptions. In the next few sections, we show how to design boosting 
algorithms that employ these new conditions and enjoy strong theoretical guarantees. 

5. Algorithms 

In this section we devise algorithms by analyzing the boosting games that employ weak- 
learning conditions in our framework. We compute the optimum Booster strategy against 
a completely adversarial Weak-Learner, which here is permitted to choose weak classifiers 
without restriction, i.e. the entire space H^^^ of all possible functions mapping examples 
to labels. By modeling Weak-Learner adversarially, we make absolutely no assumptions 
on the algorithm it might use. Hence, error guarantees enjoyed in this situation will be 
universally applicable. Our algorithms are derived from the very general drifting games 
framework (Schapire, 2001) for solving boosting games, which in turn was inspired by 
Preund's Boost-by-majority algorithm (Freund, 1995), which we review next. 

The OS Algorithm. Fix the number of rounds T and a weak-learning condition (C,B). 
We will only consider conditions that are not vacuous, i.e., at least some classifier space 
satisfies the condition, or equivalently, the space H^^^ satisfies (C,B). Additionally, we 
assume the constraints placed by C arc on individual rows. In other words, there is some 
subset Cq C M'^ of all possible rows, such that a cost matrix C belongs to the collection C 
if and only if each of its rows belongs to this subset: 



Further, we assume Co forms a convex cone i.e c, c' G Co implies tc + t'c' G Co for any non- 
negative t,t' . This also implies that C is a convex cone. This is a very natural restriction, 
and is satisfied by the space C used by the weak learning conditions of AdaBoost.MH, 
AdaBoost.Ml, AdaBoost.MR, SAMME as well as every cdgc-over-random condition. ^ 
For simplicity of presentation we fix the weights aj = 1 in each round. With defined 

1. All our results hold under the weaker restriction on the space C, where the set of possible cost vectors Co 
for a row i could depend on i. For simplicity of exposition, we stick to the more restrictive assumption 
that Co is common across all rows. 




(17) 



17 



I. MUKHERJEE AND R. E. SCHAPIRE 



as in (1), whether the final hypotheses output by Booster makes a prediction error on an 
example i is decided by whether an incorrect label received the maximum number of votes, 
fxih 1) < ^^^1=2 fT{h 0- Therefore, the optimum Booster payoff can be written as 



mm max 
CieC ftiew^"- 



. . . mm max 
Ci.(i^^-B)<0 CT»(l;.y-B)<0 ' 



^ m 

-^L-'^(/T(x„l),...,/T(x„fe)). (18) 



where the function L^"^^ 



encodes 0-1 error 



^"{s) = 1 



sil) < max s(l) 
l>i 



(19) 



In general, we will also consider other loss functions L : M'^' — ^ M such as exponential loss, 
hinge loss, etc. that upper-bound error and are proper: i.e. L(s) is increasing in the weight 
of the correct label s(l), and decreasing in the weights of the incorrect labels s{l), I ^ 1. 

Directly analyzing the optimal payoff is hard. However, Schapire (2001) observed that 
the payoffs can be very well approximated by certain potential functions. Indeed, for any 
b G M*^ define the potential function : M'^ — > M by the following recurrence: 



^0^ 



L 

min max E^^p U?, (s + e;)l 

ceCo peA{i,..,fe} H L t i V /J 

s.t. E;^p[c(0] < (b,c). 



(20) 



where I ~ p denotes that label I is sampled from the distribution p, and G M*^ is the unit- 
vector whose Ith coordinate is 1 and the remaining coordinates zero. Notice the recurrence 
uses the collection of rows Cq instead of the collection of cost matrices C. When there 
are T — t rounds remaining (that is, after t rounds of boosting), these potential functions 
compute an estimate 0j._j(st) of whether an example x will be misclassified, based on its 
current state Sj consisting of counts of votes received so far on various classes: 



t-1 



st{l) = Y,'^[ht'{x) = l]. 



(21) 



t'=i 



Notice this definition of state assumes that = 1 in each round. Sometimes, we will choose 
the weights differently. In such cases, a more appropriate definition is the weighted state 
ft G M*', tracking the weighted counts of votes received so far: 



t-1 



ft(,l) = Y,^t'l[hAx)=l]. 



(22) 



t'=i 



However, unless otherwise noted, we will assume at = I, and so the definition in (21) will 
suffice. 

The recurrence in (20) requires the max player's response p to satisfy the constraint that 
the expected cost under the distribution p is at most the inner- product (c, b). If there is no 
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distribution satisfying this requirement, then the vahic of the max expression is — oo. The 
existence of a vaUd distribution depends on both b and c and is captured by the following: 

3p G A {1, . . . , A;} : E/^p [c{l)] < (c, b) ^ min c{l) < (b, c) . (23) 

In this paper, the vector b will always correspond to some row B(i) of the baseline used in 
the weak learning condition. In such a situation, the next lemma shows that a distribution 
satisfying the required constraints will always exist. 

Lemma 7 If Cq is a cone and (17) holds, then for any row b = B(i) of the baseline and 
any cost vector c G Cq, (23) holds unless the condition {C, B) is vacuous. 

Proof We show that if (23) does not hold, then the condition is vacuous. Assume that for 
row b = B(io) of the baseline, and some choice of cost vector c G Cq, (23) does not hold. 
We pick a cost-matrix C G C, such that no weak classifier h can satisfy the requirement 
(2), implying the condition must be vacuous. The ig*^ row of the cost matrix is c, and the 
remaining rows are 0. Since Cq is a cone, G Cq and hence the cost matrix lies in C. With 
this choice for C, the condition (2) becomes 

c{h{xi)) = C{i,h{xi)) < (C(i),B(i)) = (c,b) <minc(0, 

where the last inequality holds since, by assumption, (23) is not true for this choice of 
c, b. The previous equation is an impossibility, and hence no such weak classifier h exists, 
showing the condition is vacuous. ■ 

Lemma 7 shows that the expression in (20) is well defined, and takes on finite values. We 
next record an alternate dual form for the same recurrence which will be useful later. 

Lemma 8 The recurrence in (20) is equivalent to 

(^^{s) = minrnlx U^i (s + e,) - (c(Z) - (c, b))| . (24) 

Proof Using Lagrangean multipliers, we may convert (20) to an unconstrained expression 
as follows: 

(/)^(s) = min max min \ Ei^p (/)^_^ (s + 6;) - A (IE;^p [c{l)] - (c, b)) i . 
cgCo peA{l,...,A;} A>0 L L J J 

Applying Theorem 1 to the inner min-max expression we get 

0^(s) = min min max \ E^^p 4k_^ (s + e;) - (M^^p [Xc{l)] - (Ac, b)) \ . 

cgCo A>0 peA{l,...,fc} L L J J 

Since Cq is a cone, c G Co implies Ac G Cq. Therefore we may absorb the Lagrange multiplier 
into the cost vector: 

^t'(s) = mi> max E/^p 4>^_i (s + e^) - (c(/) - (c, b)) 

ceCo pG A|l,...,fe I L 

For a fixed choice of c, the expectation is maximized when the distribution p is concentrated 
on a single label that maximizes the inner expression, which completes our proof. ■ 
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The dual form of the recurrence is useful for optimally choosing the cost matrix in each 
round. When the weak learning condition being used is (C,B), Schapirc (2001) proposed 
a Booster strategy, called the OS strategy, which always chooses the weight a* = 1, and 
uses the potential functions to construct a cost matrix Cj as follows. Each row Ct{i) of 
the matrix achieves the minimum of the right hand side of (24) with b replaced by B(i), t 
replaced by T — i, and s replaced by current state St{i): 

Ct{i) = argminmlx |</)?^']_i (s + e;) - (c(0 - (c, B(i)))| . (25) 

The following theorem, proved in the appendix, provides a guarantee for the loss suffered 
by the OS algorithm, and also shows that it is the gamc-theoretically optimum strategy 
when the number of examples is large. Similar results have been proved by Schapire (2001), 
but our theorem holds much more generally, and also achieves tighter lower bounds. 

Theorem 9 (Extension of results in (Schapire, 2001)) Suppose the weak-learning con- 
dition is not vacuous and is given by (C, B), where C is such that, for some convex cone 

Co C R'^, the condition (17) holds. Let the potential functions (j)]^ be defined as in (20), and 
assume the Booster employs the OS algorithm, choosing at = 1 and Cf as in (25) in each 
round t. Then the average potential of the states, 

-. m 

III- . 

1=1 

never increases in any round. In particular, the loss suffered after T rounds of play is at 
most 

in 

l^</>?«(0). (26) 

III . 

1=1 

Further, under certain conditions, this bound is nearly tight. In particular, assume the 
loss function does not vary too much but satisfies 

sup |L(s) -L(s')| < 0iL,T), (27) 

where St, a subset of {s G M'^ : ||s||oo < T"}, is the set of all states reachable in T iterations, 
and 0{L,T) is an upper bound on the discrepancy of losses between any two reachable states 
when the loss function is L and the total number of iterations is T. Then, for any £ > 0, 
when the number of examples m is sufficiently large, 

»>^^^. (28) 

£ 

no Booster strategy can guarantee to achieve in T rounds a loss that is £ less than the bound 
(26). 

In order to implement the near optimal OS strategy, we need to solve (25). This is compu- 
tationally only as hard as evaluating the potentials, which in turn reduces to computing the 
recurrences in (20). In the next few sections, we study how to do this when using various 
losses and weak learning conditions. 
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6. Solving for any fixed edge-over- random condition 

In this section we show how to implement the OS strategy when the weak learning condi- 
tion is any fixed cdgc-ovcr-random condition: (C, B) for some B G 5^°'^. By our previous 
discussions, this is equivalent to computing the potential (pf by solving the recurrence in 
(20), where the vector b corresponds to some row of the baseline B. Let C A {1, . . . , /c} 
denote the set of all edge-over-random distributions on {1,. . . ,k} with 7 more weight on 
the first coordinate: 



A^ = {b G A {1, . . . , A:} : 5(1) - 7 = max {6(2), . . . , 6(fe)}} . (29) 

Note, that B'^°^ consists of all matrices whose rows belong to the set A^. Therefore we are 
interested in computing cj)^, where b is an arbitrary edge-over-random distribution: b G A!^. 
We begin by simplifying the recurrence (20) satisfied by such potentials, and show how to 
compute the optimal cost matrix in terms of the potentials. 

Lemma 10 Assume L is proper, and b G A^ an edge-over-random distribution. Then 
the recurrence (20) may be simplified as 



Further, if the cost matrix Ct is chosen as follows 

Ct{i,l) = (l>^_t-i{st{i) + ei), 
then Ct satisfies the condition in (25), and hence is the optimal choice. 
Proof Let Cq^'^ C M*^ denote all vectors c satisfying \/l : c(l) < c(Z). Then, we have 



(30) 



(31) 



mm 



E,^p [(t)t-i (s + ei)] 



max 

peA{i,...,fc} - - (by (20) 

s.t. E,^p[c(Z)] < E;^b [c(0] , 



mm max mm 
cecg"-- peA A>o 



{E/^p 



s + ei] 



+ A (Ez^b [c( 



E, 



=(0])}(L 



agrangean ) 



min minmaxE/^p (s -|- e;) -|- A (b — p, c) (Theorem 1) 

cSCq"'' A>0 pGA L J 

min maxE;^p (f)^_i (s + e;) + (b — p, c) (absorb A into c) 



max min E;^o 
peA cecg°' ^ 



(l)t-i (s + e;) (b - p, c) (Theorem 1) . 



Unless 6(1) — p(l) < and 6(/) —p{l) > for each I > 1, the quantity (b — p, c) can be made 
arbitrarily small for appropriate choices of c G Cq^'^. The max-player is therefore forced to 
constrain its choices of p, and the above expression becomes 

max Ei^p [(pf_^ (s + 

p6A 

s.t. 6(0 - q{l) 
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Lemma 6 of (Schapire, 2001) states that if L is proper (as defined here), so is (pf; the same 
result can be extended to our drifting games. This implies the optimal choice of p in the 
above expression is in fact the distribution that puts as small weight as possible in the first 
coordinate, namely b. Therefore the optimum choice of p is b, and the potential is the 
same as in (30). 

We end the proof by showing that the choice of cost matrix in (31) is optimum. Theo- 
rem 9 states that a cost matrix Ct is the optimum choice if it satisfies (25), that is, if the 
expression 

max (s + e,) - (Ct(z,Z) - (Q(i), B(z)))} (32) 

is equal to 



mm 

ceCo 



mix {cj^^l, (s + ei) - (c(0 - (c, B(i)))} = 0?« (s) , (33) 

where the equality in (33) follows from (24). If Ct{i) is chosen as in (31), then, for any 
label I, the expression within max in (32) evaluates to 

4^^^{s + el) - (s + eO-(Q(i),B(i))) 

= (B(0,Ci(i)) 
= E^^B{i)[a(i,/)] 



t>T-t-i (s + ez) 



where the last equality follows from (30). Therefore the max expression in (32) is also equal 
to (At!*] (s) , which is what we needed to show. ■ 

Eq. (31) in Lemma 10 implies the cost matrix chosen by the OS strategy can be expressed 
in terms of the potentials, which is the only thing left to calculate. Fortunately, the simpli- 
fication (30) of the drifting games recurrence, allows the potentials to be solved completely 
in terms of a random- walk 7^^ (x) . This random variable denotes the position of a particle 
after t time steps, that starts at location x G M'^, and in each step moves in direction 
with probability b{l). 

Corollary 11 The recurrence in (30) can be solved as follows: 

</>,^s)=E[L(7^*,(s))]. (34) 
Proof Inductively assuming (f>^_i{x) = E [L(7^^~"^(x))] , 

0t(s) = E^^b [mt'i^) + ^i)] = ^ [HKi^))] ■ 

The last equality follows by observing that the random position TZ^^{s) + e; is distributed 
as "when I is sampled from b. ■ 

Lemma 10 and Corollary 11 together imply: 

Theorem 12 Assume L is proper and b G is an edge-over-random distribution. Then 
the potential (f)\ , defined by the recurrence in (20), has the solution given in (34) in terms 
of random walks. 
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Before we can compute (34), we need to choose a loss function L. We next consider two 
options for the loss — the non-convex 0-1 error, and exponential loss. 

Exponential Loss. The exponential loss serves as a smooth convex proxy for discon- 
tinuous non-convex 0-1 error (19) that we would ultimately like to bound, and is given 

by 

k 

L^^P(s) = ^e''(^'-*i). (35) 

1=2 

The parameter jy can be thought of as the weight in each round, that is, at = 77 in each 
round. Then notice that the weighted state ft of the examples, defined in (22), is related to 
the unweighted states st as ft{l) = r]St{l). Therefore the exponential loss function in (35) 
directly measures the loss of the weighted state as 

k 

L^^P(fj) = ^e^*(')--'^*W. (36) 
1=2 

Because of this correspondence, the optimal strategy with the loss function L^^p and at = r] 
is the same as that using loss and at = I. We study the latter setting so that we may 
use the results derived earlier. With the choice of the exponential loss L^"^, the potentials 
are easily computed, and in fact have a closed form solution. 

Theorem 13 IfL^"^ is as in (35), where r] is non-negative, then the solution in Theorem 12 
evaluates to ^J'(s) = Ef=2(«i)*e'"^*'"'^^ where = 1 - (61 -h hi) + e'^bi + C'^bi. 

The proof by induction is straightforward. By tuning the weight rj, each a/ can be always 
made less than 1. This ensures the exponential loss decays exponentially with rounds. In 
particular, when B = (so that the condition is (C'^"'', U^)), the relevant potential 0t(s) 
or ^t(f) is given by 

k k 
Ms) = Mf) = «(7, vY E ^'^''"''^ = '^(^' E (37) 

1=2 1=2 

where 

?7) = 1 + (e^ + e-^ - 2) - (l - e"") 7. (38) 

The cost-matrix output by the OS algorithm can be simplified by rescaling, or adding the 
same number to each coordinate of a cost vector, without affecting the constraints it imposes 
on a weak classifier, to the following form 



C{i,l) 



'(e'' - l)e''(^'-''i) if/>l, 
(e-"- 1)^1^26''^^'""'^ if/ = l. 



Using the correspondence between unweighted and weighted states, the above may also be 
rewritten as: 

{(pV — \\ pfi-fi if / > 1 

k . . (39) 
(e-" - 1) Et2 e^'"^' if 1 = 1. ^ ' 
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With such a choice, Theorem 9 and the form of the potential guarantee that the average 
loss 

-. m -. m 

^E^r(«*w) = ^E^^^'(f*w) (40) 
1=1 1=1 

of the states changes by a factor of at most k (7, rj) every round. Therefore the final loss, 
which upper bounds the error, i.e., the fraction of misclassified training examples, is at most 
{k — 1)k (7, 77)^. Since this upper bound holds for any value of 77, we may tune it to optimize 
the bound. Setting 77 = In (1 + 7), the error can be upper bounded by {k — l)e~'^'^ 

Zero-one Loss. There is no simple closed form solution for the potential when using the 
zero-one loss L^''"' (19). However, we may compute the potentials efHciently as follows. To 
compute <^j'(s), we need to find the probability that a random walk (making steps according 
to b) of length t in Z*^, starting at s will end up in a region where the loss function is 1. Any 
such random walk will consist of xi steps in direction e/ where the non-negative xi = t. 
The probability of each such path is ^f'- Further, there are exactly (^^ * such paths. 
Starting at state s, such a path will lead to a correct answer only if si + xi > s; + xi for 
each I > 1. Hence we may write the potential <?!>t'(s) as 

X\,...,Xi^ 

S.t. Xl + . . . + Xk 

yi : Xl 
VZ : Xi + si 



= t 
> 

< Xl+ Sl. 



Since the x^'s are restricted to be integers, this problem is presumably hard. In particular, 
the only algorithms known to the authors that take time logarithmic in t is also exponential 
in k. However, by using dynamic programming, we can compute the summation in time 
polynomial in \si\, t and k. In fact, the runtime is always 0{t^k), and at least n{tk). 

The bounds on error we achieve, although not in closed form, are much tighter than 
those obtainable using exponential loss. The exponential loss analysis yields an error upper 
bound of (k — l)e~'^'^ Using a different initial distribution, Schapire and Singer (1999) 
achieve the slightly better bound \/ {k — l)e~'^'^ However, when the edge 7 is small and 
the number of rounds are few, each bound is greater than 1 and hence trivial. On the other 
hand, the bounds computed by the above dynamic program are sensible for all values of k, 7 
and T. When b is the 7-biased uniform distribution b = ("^F+T' ■ " " ' ^H^) ^ table 

containing the error upper bound ^j'(O) for A; = 6, 7 = and small values for the number of 
rounds T is shown in Figure 2(a); note that with the exponential loss, the bound is always 
1 if the edge 7 is 0. Further, the bounds due to the exponential loss analyses seem to imply 
that the dependence of the error on the number of labels is monotonic. However, a plot of 
the tighter bounds with edge 7 = 0.1, number of rounds T = 10 against various values of k, 
shown in Figure 2(b), indicates that the true dependence is more complicated. Therefore 
the tighter analysis also provides qualitative insights not obtainable via the exponential loss 
bound. 
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Figure 2: Plot of potential value (/'^(O) where b is the 7-biased uniform distribution: b = {^-jp- + 
7, . . . , ^pp-)- (a): Potential values (rounded to two decimal places) for different number 

of rounds T using 7 = and fc = 6. These are bounds on the error, and less than 1 even when the 
edge and number of rounds are small, (b): Potential values for different number of classes k, with 
7 = 0.1, and T — 10. These are tight estimates for the optimal error, and yet not monotonic in the 
number of classes. 



7. Solving for the minimal weak learning condition 

In the previous section we saw how to find the optimal boosting strategy when using any 
fixed edge-over-random condition. However as we have seen before, these conditions can 
be stronger than necessary, and therefore lead to boosting algorithms that require addi- 
tional assumptions. Here we show how to compute the optimal algorithm while using the 
weakest weak learning condition, provided by (16), or equivalently the condition used by 
AdaBoost.MR, (C'^^, B^^). Since there are two possible formulations for the minimal con- 
dition, it is not immediately clear which to use to compute the optimal boosting strategy. 
To resolve this, we first show that the optimal boosting strategy based on any formulation 
of a necessary and sufficient weak learning condition is the same. Having resolved this am- 
biguity, we show how to compute this strategy for the exponential loss and 0-1 error using 
the first formulation. 

7.1 Game-theoretic equivalence of necessary and sufficient vi^eak-learning 
conditions 

In this section we study the effect of the weak learning condition on the game-theoretically 
optimal boosting strategy. We introduce the notion of game-theoretic equivalence between 
two weak learning conditions, that determines if the payoffs (18) of the optimal boosting 
strategies based on the two conditions are identical. This is different from the usual notion 
of equivalence between two conditions, which holds if any weak classifier space satisfies 
both conditions or neither condition. In fact we prove that game-theoretic equivalence is a 
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broader notion; in other words, equivalence implies game-theoretic equivalence. A special 
case of this general result is that any two weak learning conditions that are necessary 
and sufficient, and hence equivalent to boostability, are also game-theoretically equivalent. 
In particular, so are the conditions of AdaBoost.MR and (16), and the resulting optimal 
Booster strategies enjoy equally good payoffs. We conclude that in order to derive the 
optimal boosting strategy that uses the minimal weak-learning condition, it is sound to use 
either of these two formulations. 

The purpose of a weak learning condition (C, B) is to impose restrictions on the Weak- 
Learner's responses in each round. These restrictions are captured by subsets of the weak 
classifier space as follows. If Booster chooses cost-matrix C G C in a round, the Weak- 
Learner's response h is restricted to the subset C 'H'^^^ defined as 



Thus, a weak learning condition is essentially a family of subsets of the weak classifier 
space. Further, smaller subsets mean fewer options for Weak-Learner, and hence better 
payoffs for the optimal boosting strategy. Based on this idea, we may define when a weak 
learning condition (Ci,Bi) is game-theoretically stronger than another condition (€2,^2) if 
the following holds: For every subset in the second condition (that is C2 G C2), there is 
a subset Sci in the first condition (that is Ci G Ci), such that Sci C Sc2- Mathematically, 
this may be written as follows: 



Intuitively, a game theoretically stronger condition will allow Booster to place similar or 
stricter restrictions on Weak-Learner in each round. Therefore, the optimal Booster payoff 
using a game-theoretically stronger condition is at least equally good, if not better. It 

therefore follows that if two conditions are both gamc-theoretically stronger than each other, 
the corresponding Booster payoffs must be equal, that is they must be game-theoretically 
equivalent. 

Note that game-theoretic equivalence of two conditions does not mean that they are 
identical as families of subsets, for we may arbitrarily add large and "useless" subsets to 
the two conditions without affecting the Booster payoffs, since these subsets will never be 
used by an optimal Booster strategy. In fact we next show that game-theoretic equivalence 
is a broader notion than just equivalence. 

Theorem 14 Suppose (Ci,Bi) and (€2,^2) are two equivalent weak learning conditions, 
that is, every space % satisfies both or neither condition. Then each condition is game- 
theoretically stronger than the other, and hence game-theoretically equivalent. 

Proof We argue by contradiction. Assume that despite equivalence, the first condition 

(without loss of generality) includes a particularly hard subset Sci ^ ^'^^^ Ci G Ci which 
is not smaller than any subset in the second condition. In particular, for every subset 
) C2 G C2 in the second condition is satisfied by some weak classifier /1C2 not satisfying 
the hard subset in the first condition: ^ S'cz \ -S'ci • Therefore, the space 




VCiGCi,3C2GC2:Sci c5c2. 



-H = {hc2 ■■ C2 G C2] 
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formed by just these classifiers satisfies the second condition, but has an empty intersection 
with Sci ■ In other words, T-L satisfies the second but not the first condition, a contradiction 
to their equivalence. ■ 

An immediate corollary is the game theoretic equivalence of necessary and equivalent con- 
ditions. 

Corollary 15 Any two necessary and sufficient weak learning conditions are game-theoretically 
equivalent. In particular the optimum Booster strategies based on AdaBoost.MR's condition 
(C^^, B;f ^) and (16) have equal payoffs. 

Therefore, in deriving the optimal Booster strategy, it is sound to work with cither Ad- 
aBoost.MR's condition or (16). In the next section, we actually compute the optimal 
strategy using the latter formulation. 

7.2 Optimal strategy with the minimal conditions 

In this section we compute the optimal Booster strategy that uses the minimum weak 
learning condition provided in (16). We choose this instead of AdaBoost.MR's condition 
because this description is more closely related to the edge-over-random conditions, and 
the resulting algorithm has a close relationship to the ones derived for fixed edge-over- 
random conditions, and therefore more insightful. However, this formulation does not state 
the condition as a single pair (C,B), and therefore we cannot directly use the result of 
Theorem 9. Instead, we define new potentials and a modified OS strategy that is still 
nearly optimal, and this constitutes the first part of this section. In the second part, we 
show how to compute these new potentials and the resulting OS strategy. 

7.2.1 Modified potentials and OS strategy 

The condition in (16) is not stated as a single pair (C'^°'^,B), but uses all possible edge- 
over-random baselines B G B^"^ . Therefore, we modify the definitions (20) of the potentials 
accordingly to extract an optimal Booster strategy. Recall that is defined in (29) as 

the set of all edge-over-random distributions which constitute the rows of edge-over-random 
baselines B G B^"^ . Using these, define new potentials ^t(s) as follows: 

min max max E;^d f<?5't-i (s + e/)! 
c^t{s)= ^e^o" beA* peA{i,...,fe} (41) 

s.t. E;^p[c(0] < (b,c). 

The main difference between (41) and (20) is that while the older potentials were defined 
using a fixed vector b corresponding to some row in the fixed baseline B, the new definition 
takes the maximum over all possible rows b G that an edge-over-random baseline B G 
B^"^ may have. As before, we may write the recurrence in (41) in its dual form 

(t>t{s) = min max max {(/)<_! (s + e^) - {c{l) - (c, b))} . (42) 
ceCgo-^beA* 1=1 

The proof is very similar to that of Lemma 8 and is omitted. We may now define a new OS 
strategy that chooses a cost-matrix in round t analogously: 

k 

Ct{i) G argmin max max {(?!)t_i (s -|- e/) — (c(Z) — (c, b))} . (43) 

ceCg"-- beAfc 1=1 
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where recall that St{i) denotes the state vector (defined in (21)) of example i. With this 
strategy, we can show an optimality result very similar to Theorem 9. 

Theorem 16 Suppose the weak-learning condition is given by (16). Let the potential func- 
tions he defined as in (41), and assume the Booster employs the modified OS strategy, 
choosing at = 1 and Ct as in (43) in each round t. Then the average potential of the states, 

^ m 

-Y.'t'T-t (St(i)), 
i=l 

never increases in any round. In particular, the loss suffered after T rounds of play is at 
most (/>t(0). 

Further, for any e > 0, when the loss function satisfies (27) and the number of examples 
m is as large as in (28), no Booster strategy can guarantee to achieve less than ^t(O) — e 
loss in T rounds. 

The proof is very similar to that of Theorem 9 and is omitted. 
7.2.2 Computing the new potentials. 

Here we show how to compute the new potentials. The resulting algorithms will require 
exponential time, and we provide some empirical evidence showing that this might be 
necessary. Finally, we show how to carry out the computations efficiently in certain special 
situations. 

An exponential time algorithm. Here we show how the potentials may be computed 
as the expected loss of some random walk, just as we did for the potentials arising with fixed 
edge-over-random conditions. The main difference is there will be several random walks to 
choose from. 

We first begin by simplifying the recurrence (41), and expressing the optimal cost matrix 
in (43) in terms of the potentials, just as we did in Lemma 10 for the case of fixed edge- 
over-random conditions. 

Lemma 17 Assume L is proper. Then the recurrence (41) may be simplified as 

(/)t(s) = max Ei^b [</>*-! (s + e,)] . (44) 
beAfc 

Further, if the cost matrix Ct is chosen as follows: 

Ct{i, I) = (l>T-t-i{stii) + ei), (45) 
then Ct satisfies the condition in (43). 

The proof is very similar to that of Lemma 10 and is omitted. Eq. (45) implies that, as 
before, computing the optimal Booster strategy reduces to computing the new potentials. 
One computational difficulty created by the new definitions (41) or (44) is that they require 
infinitely many possible distributions b € to be considered. We show that we may in 
fact restrict our attention to only finitely many of such distributions described next. 



28 



A Theory of Multiclass Boosting 



At any state s and number of remaining iterations t, let tt be a permutation of the 
coordinates {2,. . .,k} that sorts the potential values: 

(t)t-i (s + e^(fe)) > 0t_i (s + e^(fc_i)) > . . . > ^t-i (s + e^(2)) . (46) 

For any permutation tt of the coordinates {2,...,k}, let denote the 7-biased uniform 
distribution on the a coordinates {1, tt^, iTk-i, ■ ■ ■ , Trk-a+2}' 

K{1)={'-^ ifZG{7rfe,...,7r,_„+2} (47) 

otherwise. 

Then, the next lemma shows that we may restrict our attention to only the distributions 
{bf , . . . , b^} when evaluating the recurrence in (44). 

Lemma 18 Fix a state s and remaining rounds of boosting t. Let tt he a permutation of 
the coordinates {2,..., A;} satisfying (46), and define bj as in (47). Then the recurrence 
(44) may he simplified as follows: 

(t>t{s) = max E/^b [^t-i (s + e;)] = rnax E/^bj [^t-i (s + e;)] . (48) 

beAi» 2<a<fe 

Proof Assume (by relabeling the coordinates if necessary) that tt is the identity permu- 
tation, that is, 7r(2) = 2,...,7r(fc) = A;. Observe that the right hand side of (44) is at 
least as much the right hand side of (48) since the former considers more distributions. We 
complete the proof by showing that the former is also at most the latter. 
By (44), we may assume that some optimal b satisfies 

b{k) = --- = b{k-a + 2) = b{l)-j, 
b{k-a + l) < 6(1) -7, 
b{k-a) = --- = b{2) = 0. 

Therefore, b is a distribution supported on a + 1 elements, with the minimum weight placed 
on element k — a + 1. This implies h{k — o + 1) G [0, l/(a + 1)]. 
Now, E;^b [4>t-i{s + e^)] may be written as 

7 • (?!)t_i(s + ei) + b{k-a + l)0f-i(s + e^-a+i) 

-t>t-iis + ei) + (t>t-i{s + ek-a+2) + ■■■ ^t-i(s + efe) 



+ (l-7-6(A:-a + l))- 



a 



, , h(k — a + 1) , , , 
7 • ^t-A^ + ei) H 0t-i(s + efc_a+i) 

1-7 



i)f_i(s + ei) + ^t-i(s + efc-a+2) + . . . (/>f-i(s + efc) 

a 



Replacing — a-|-l)bya;in the above expression, we get a linear function of x. When 
restricted to [0, l/(a + 1)] the maximum value is attained at a boundary point. For a; = 0, 
the expression becomes 

, . , ^ , X (s + ei)-|-0t_i(s-|-ejt_a+2) + ...<?it-i(s + efe) 
7 • 04-1(8 + ei) + (1 - 7) . 
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For X = l/(a + 1), the expression becomes 

7 • (/.t_i(s + ei) + (1 - 7) . 

Since b{k — a + 1) lies in [0, l/(a + 1)], the optimal value is at most the maximum of the 
two. However each of these last two expressions is at most the right hand side of (48), 
completing the proof. ■ 

Unraveling (48), we find that (pti^) is the expected loss of the final state reached by some 
random walk of t steps starting at state s. However, the number of possibilities for the 
random- walk is huge; indeed, the distribution at each step can be any of the k—1 possibilities 
for a G {2, ... , k}, where the parameter a denotes the size of the support of the 7- 
biased uniform distribution chosen at each step. In other words, at a given state s with 
t rounds of boosting remaining, the parameter a determines the number of directions the 
optimal random walk will consider taking; we will therefore refer to a as the degree of the 
random walk given (s, t). Now, the total number of states reachable in T steps is O (T'^^^) . 
If the degree assignment every such state, for every value oi t < T is fixed in advance, 
a = {o,(s, t) : t < T,s reachable}, we may identify a unique random walk 7^'*'*(s) of length t 
starting at step s. Therefore the potential may be computed as 

(f>t{s) = maxE [7^^'*(s)] . (49) 

A dynamic programming approach for computing (49) requires time and memory linear in 
the number of different states reachable by a random walk that takes T coordinate steps: 
0(T^~^). This is exponential in the datasct size, and hence impractical. In the next two 
sections we show that perhaps there may not be any way of computing these efficiently in 
general, but provide efficient algorithms in certain special cases. 

Hardness of evaluating the potentials. Here we provide empirical evidence for the 
hardness of computing the new potentials. We first identify a computationally easier prob- 
lem, and show that even that is probably hard to compute. Eq. (48) implies that if the 

potentials were efficiently computable, the correct value of the degree a could be determined 
efficiently. The problem of determining the degree a given the state s and remaining rounds 
t is therefore easier than evaluating the potentials. However, a plot of the degrees against 
states and remaining rounds, henceforth called a degree map, reveals very little structure 
that might be captured by a computationally efficient function. 

We include three such degree maps in Figure 3. Only three classes k = 3 are used, and 
the loss function is 0-1 error. We also fix the number T of remaining rounds of boosting 
and the value of the edge 7 for each plot. For ease of presentation, the 3-dimensional states 
s = (si, S2, S3) are compressed into 2-dimensional pixel coordinates {u = S2 — si,v = 53 — 52)- 
It can be shown that this does not take away information required to evaluate the potentials 
or the degree at any pixel (u, v). Further, only those states are considered whose compressed 
coordinates u,v lie in the range [— T, T]; in T rounds, these account for all the reachable 
states. The degrees arc indicated on the plot by colors. Our discussion in the previous 
sections implies that the possible values of the degree is 2 or 3. When the degree at a pixel 
{u, v) is 3, the pixel is colored green, and when the degree is 2, it is colored black. 
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Figure 3: Green pixels have degree 3, black pixels have degree 2. Each step is diagonally down (left), 
and up (if x < y) and right (if x > y) and both when degree is 3. The rightmost figure uses 7 = 0.4, 
and the other two 7 = 0. The loss function is 0-1. 





Figure 4: Optimum recurrence value. We set 7 = 0. Surface is irregular for smaller values of T, but 
smoother for larger values, admitting hope for approximation. 



Note that a random walk over the space s G consisting of distributions over coordinate 
steps {(1, 0, 0), (0, 1, 0), (0, 0, 1)} translates to a random walk over {u,v) G where each 
step lies in the set {(—1, —1), (1, 0), (0, 1)}. In Figure 3, these correspond to the directions 
diagonally down, up or right. Therefore at a black pixel, the random walk either chooses 
between diagonally down and up, or between diagonally down and right, with probabilities 
{1/2+7/2,1/2 — 7/2}. On the other hand, at a green pixel, the random walk chooses 
among diagonally down, up and right with probabilities (7 + (1 — 7)/3, (1 — 7)/3, (1 — 7)/3). 
The degree maps are shown for varying values of T and the edge 7. While some patterns 
emerge for the degrees, such as black or green depending on the parity of u or u, the authors 
found the region near the line u = v still too complex to admit any solution apart from a 
brute-force computation. 

We also plot the potential values themselves in Figure 4 against different states. In 
each plot, the number of iterations remaining, T, is held constant, the number of classes is 
chosen to be 3, and the edge 7 = 0. The states are compressed into pixels as before, and the 



31 



I. MUKHERJEE AND R. E. SCHAPIRE 



k=6 T=10 
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Figure 5: Comparison of (f>tiO) (blue) with maXq(/)^(0) (red) over different rounds t and different 
number of classes k. We set 7 = in both. 

potential is plotted against each pixel, resulting in a 3-dimensional surface. We include two 
plots, with different values for T. The surface is irregular for T = 3 rounds, but smoother 
for 20 rounds, admitting some hope for approximation. 

An alternative approach would be to approximate the potential (j)t by the potential (f>^ 
for some fixed b S corresponding to some particular edge-over-random condition. Since 
4>t ^ (p^ for all edge-over-random distributions b, it is natural to approximate by choosing 
b that maximizes the fixed edge-over-random potential. (It can be shown that this b 
corresponds to the 7-biased uniform distribution.) Two plots of comparing the potential 
values at 0, (pxiO) and maxb 0^(0), which correspond to the respective error upper bounds, 
is shown in Figure 5. In the first plot, the number of classes k is held fixed at 6, and the 
values are plotted for different values of iterations T. In the second plot, the number of 
classes vary, and the number of iterations is held at 10. Both plots show that the difference 
in the values is significant, and hence maxb 0j'(O) would be a rather optimistic upper bound 
on the error when using the minimal weak learning condition. 

If we use exponential loss (35), the situation is not much better. The degree maps 
for varying values of the weight parameter rj against fixed values of edge 7 = 0.1, rounds 
remaining T = 20 and number of classes k = 3 are plotted in Figure 6. Although the 
patterns are simple, with the degree 3 pixels forming a diagonal band, we found it hard to 
prove this fact formally, or compute the exact boundary of the band. However the plots 
suggest that when rj is small, all pixels have degree 3. We next find conditions under which 
this opportunity for tractable computation exists. 

Efficient computation in special cases. Here we show that when using the exponential 
loss, if the edge 7 is very small, then the potentials can be computed efficiently. We first show 
an intermediate result. We already observed empirically that when the weight parameter r] 
is small, the degrees all become equal to k. Indeed, we can prove this fact. 
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Figure 6: Green pixels have degree 3, black pixels have degree 2. Each step is diagonally down (left), 
and up (if x < y) and right (if x > y) and both when degree is 3. Each plot uses T = 20,7 = O-l- 
The values of rj are 0.08, 0.1 and 0.3, respectively. With smaller values of ry, more pixels have degree 
3. 



Lemma 19 // the loss function being used is exponential loss (35) and the weight parameter 
r] is small compared to the number of rounds 



1 



1 1 



(50) 



then the optimal value of the degree a in (48) is always k. Therefore, in this situation, the 
potential (pt using the minimal weak learning condition is the same as the potential (p]^ using 
the j-biased uniform distribution u, 



u 



1 -7 , 1-7 1-7 



(51) 



and hence can be efficiently computed. 



Proof We show (j)t = by induction on the remaining number t of boosting iterations. 
The base case holds since, by definition, cpQ = (p^ = L^^^. Assume, inductively that 



i(s) = 0r-i(s) = '^(7,^)*-'E^ 



»7(si-si) 



1=2 



where the second equality follows from (37). We show that 

0t(s) = Ei^u [(pt-iis + ei)] . 



(52) 



(53) 



By the inductive hypothesis and (30), the right hand side of (53) is in fact equal to i;^", and 
we will have shown (pt = (pf- The proof will then follow by induction. 

In order to show (53), by Lemma 18, it suffices to show that the optimal degree a 
maximizing the right hand side of (48) is always k: 



lEi-b- [(pt-i (s + e,)] < Ei^bi [(pt-i (s + e;)] 



(54) 
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By (52), <pt-i (s + e^o) may be written as (l)t-i{s) + «;(7, 77)* ^ ■ ^ig, where the term is: 

60 



(e'' - l)e'?K-*i) if/o/1, 
(e"" - 1) Ef=2 e'?(^*-^i) if lo = 1. 



Therefore (54) is the same as: E;^bj [Ci] ^ [6]- Assume (by relabehng if necessary) 

that IT is the identity permutation on coordinates {2, . . . , k}. Then the expression E^^b^ [^i] 
can be written as 



k 

= 76 + (1-7) 



=fe-a+2 
^1 + J2'l=k-a+2 



It suffices to show that the term in curly brackets is maximized when a = k. We will in 
fact show that the numerator of the term is negative if a < A;, and non-negative for a = k, 
which will complete our proof. Notice that the numerator can be written as 

{k ^ k 

e^/(«i-«i) I - (1 - e"'') e''^"-''^ 

l=k-a+2 ) 1=2 

{k k ^ k 
Y e^(^'-^i) - Y e^(^'-"i) I + {(e" - 1) - (1 - e"")} ^ e''^''-'^^ 

l=k-a+2 1=2 ) 1=2 

k (k-a+l 1 

= {e'' + e-" - 2} ^ e'?(5i-5i) _ (e*? - 1) J ^ e^{si-s^) I . 

1=2 I 1=2 ) 

When a = k, the second summation disappears, and we are left with a non-negative ex- 
pression. However when a < k, the second summation is at least e''^''^"*^). Since t < T, 
and in t iterations the absolute value of any state coordinate |sf(Z)| is at most T, the first 
summation is at most {k — l)e'^^''^ and the second summation is at least e~'^^'^. Therefore 
the previous expression is at most 

{k - 1) (e" + e"" - 2) e^"^ - (e" - l)e-^"^ 
= {e"^ - l)e-2'?^ {(A; - 1)(1 - e-'^)e^''^ - l} . 

We show that the term in curly brackets is negative. Firstly, using > 1 + x, we have 
1 — e"'' < 7] < 1/(4(A; — 1)) by choice of 77. Therefore it suffices to show that e^^"^ < 4. By 
choice of rj again, e^''^ < < 4. This completes our proof. ■ 

The above lemma seems to suggest that under certain conditions, a sort of degeneracy 
occurs, and the optimal Booster payoff (18) is nearly unaffected by whether we use the 
minimal weak learning condition, or the condition {C'^°^,\J^). Indeed, we next prove this 
fact. 
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Theorem 20 Suppose the loss function is as in Lemma 19, and for some parameter e > 0, 
the number of examples m is large enough 

m > . (55) 

e 

Consider the minimal weak learning condition (16), and the fixed edge-over-random condi- 
tion (C^"*", U^) corresponding to the ^-biased uniform baseline U^. Then the optimal booster 
payoffs using either condition is within e of each other. 

Proof We show that the OS strategies arising out of either condition is the same. In other 
words, at any iteration t and state Sj, both strategies play the same cost matrix and enforce 
the same constraints on the response of Weak- Learner. The theorem will then follow if we 
can invoke Theorems 9 and 16. For that, the number of examples needs to be as large as 
in (28). The required largeness would follow from (55) if the loss function satisfied (27) 
with 0(L,T) at most exp(l/4). Since the largest discrepancy in losses between two states 
reachable in T iterations is at most e'^'^ — 0, the bound follows from the choice of r] in (50) . 
Therefore, it suffices to show the equivalence of the OS strategies corresponding to the two 
weak learning conditions. 

We first show both strategies play the same cost-matrix. Lemma 19 states that the 
potential function using the minimal weak learning condition is the same as when using the 
fixed condition (C*^°'', U^): (j)t = <Pf, where u is as in (51). Since, according to (31) and (45), 
given a state and iteration t, the two strategies choose cost matrices that are identical 
functions of the respective potentials, by the equivalence of the potential functions, the 
resulting cost matrices must be the same. 

Even with the same cost matrix, the two different conditions could be imposing different 
constraints on Wcak-Lcarner, which might affect the final payoff. For instance, with the 
baseline U^, Weak-Learner has to return a weak classifier h satisfying 

Ct • Ife < Q • U^, 

whereas with the minimal condition, the constraint on h is 

Ct • l/i < max Cj • B. 

In order to show that the constraints are the same we therefore need to show that for the 
common cost matrix Cj chosen, the right hand side of the two previous expressions are the 
same: 

Q«U^ = max Ct»B^°'. (56) 
We will in fact show the stronger fact that the equality holds for every row separately: 



Vz : (Ct(z), u) = max (Q(z), b) . (57) 



To see this, first observe that the choice of the optimal cost matrix Ct in (45) implies the 
following identity 

(Q(i),b) = E^^b [</'T-^-l(s^(^) + e;)] . 
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On the other hand, (48) and Lemma 19 together imply that the distribution b maximizing 
the right hand side of the above is the 7-biascd uniform distribution, from which (57) fol- 
lows. Therefore, the constraints placed on Weak-Learner by the cost-matrix Ct is the same 
whether we use minimum weak learning condition or the fixed condition (C^°'', U^). ■ 



One may wonder why 77 would be chosen so small, especially since the previous theorem 

indicates that such choices for rj lead to degeneracies. To understand this, recall that ij 
represents the size of the weights at chosen in every round, and was introduced as a tunable 
parameter to help achieve the best possible upper bound on zero-one error. More precisely, 
recall that the exponential loss L^^{s) of the unweighted state, defined in (35), is equal 
to the exponential loss L^^P(f) on the weighted state, defined in (36), which in turn is an 
upper bound on the error L^"({t) of the final weighted state f^. Therefore the potential 
value ^t(O) based on the exponential loss L'i^^ is an upper bound on the minimum error 
attainable after T rounds of boosting. At the same time, (/)t(0) is a function of 77. Therefore, 
we may tune this parameter to attain the best bound possible. Even with this motivation, 
it may seem that a properly tuned rj will not be as small as in Lemma 19, especially since 
it can be shown that the resulting loss bound 0t(O) will always be larger than a fixed 
constant (depending on 7, k), no matter how many rounds T of boosting is used. However, 
the next result identifies conditions under which the tuned value of rj is indeed as small 
as in Lemma 19. This happens when the edge 7 is very small, as is described in the next 
theorem. Intuitively, a weak classifier achieving small edge has low accuracy, and a low 
weight reflects Booster's lack of confidence in this classifier. 

Theorem 21 When using the exponential loss function (35), and the minimal weak learn- 
ing condition (16), the loss upper bound (/>t(0) provided by Theorem 16 is more than 1 and 
hence trivial unless the parameter rj is chosen sufficiently small compared to the edge 7.' 

ry<-t. (58) 
1-7 

In particular, when the edge is very small: 

7<minU,-'-min|i,Ml, (59) 



^2'8A; \k' T _ 
the value of rj needs to be as small as in (50) . 

Proof Comparing solutions (49) and (34) to the potentials corresponding to the minimal 
weak learning condition and a fixed edge-over-random condition, we may conclude that 
the loss bound </>t(0) is in the former case is larger than (6^ (0), for any edge-over-random 
distribution b € A^. In particular, when b is set to be the 7-biascd uniform distribution 
u, as defined in (51), we get ^/)t(0) > 0^(0). Now the latter bound, according to (37), is 
'^(7) 1 where n is defined as in (38). Therefore, to get non-trivial loss bounds which are 
at most 1, we need to choose 77 such that /^{j,!)) < 1. By (38), this happens when 



(l-e-'')7 > (e^' + e-' - 2) (1^) 
1 — 7 1 — e ^ 
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Therefore (58) holds. When 7 is as smah as in (59), then 1 — 7 < and therefore, by (58), 
the bound on r) becomes identical to that in (59). ■ 

The condition in the previous theorem, that of the existence of only a very small edge, 
is the most we can assume for most practical datasets. Therefore, in such situations, we 
can compute the optimal Booster strategy that uses the minimal weak learning conditions. 
More importantly, using this result, we derive, in the next section, a highly efficient and 
practical adaptive algorithm, that is, one that does not require any prior knowledge about 
the edge 7, and will therefore work with any dataset. 

8. Variable edges 

So far we have required Weak-Learner to beat random by at least a fixed amount 7 > in 
each round of the boosting game. In reality, the edge over random is larger initially, and gets 
smaller as the OS algorithm creates harder cost matrices. Therefore requiring a fixed edge 
is either unduly pessimistic or overly optimistic. If the fixed edge is too small, not enough 
progress is made in the initial rounds, and if the edge is too large, Weak-Learner fails to meet 
the weak-learning condition in latter rounds. We fix this by not making any assumption 
about the edges, but instead adaptively responding to the edges returned by Weak-Learner. 
In the rest of the section we describe the adaptive procedure, and the resulting loss bounds 
guaranteed by it. 

The philosophy behind the adaptive algorithm is a boosting game where Booster and 
Weak Learner no longer have opposite goals, but cooperate to reduce error as fast as possible. 
However, in order to create a clean abstraction and separate implementations of the boosting 
algorithms and the weak learning procedures as much as possible, we assume neither of the 
players has any knowledge of the details of the algorithm employed by the other player. In 
particular Booster may only assume that Weak Learner's strategy is barely strong enough to 
guarantee boosting. Therefore, Booster's demands on the weak classifiers returned by Weak 
Learner should be minimal, and it should send the weak learning algorithm the "easiest" 
cost matrices that will ensure boostability. In turn. Weak Learner may only assume a 
very weak Booster strategy, and therefore return a weak classifier that performs as well as 
possible with respect to the cost matrix sent by Booster. 

At a high level, the adaptive strategy proceeds as follows. At any iteration, based on the 
states of the examples and number of remaining rounds of boosting, Booster chooses the 
game-theoretically optimal cost matrix assuming only infinitesimal edges in the remaining 
rounds. Intuitively, Booster has no high expectations of Weak Learner, and supplies it 
the easiest cost matrices with which it may be able to boost. However, in the adaptive 
setting, Weak-Learner is no longer adversarial. Therefore, although only infinitesimal edges 
are anticipated by Booster, Weak Learner cooperates in returning weak classifiers that 
achieve as large edges as possible, which will be more than just inifinitesimal. Based on 
the exact edge received in each round. Booster chooses the weight at adaptively to reach 
the most favourable state possible. Therefore, Booster plays game theoretically assuming 
an adversarial Weak Learner and expecting only the smallest edges in the future rounds, 
although Weak Learner actually cooperates, and Booster adaptively exploits this favorable 
behavior as much as possible. This way the boosting algorithm remains robust to a poorly 
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performing Weak Learner, and yet can make use of a powerful weak learning algorithm 
whenever possible. 

We next describe the details of the adaptive procedure. With variable weights we need 
to work with the weighted state ft{i) of each example i, defined in (22). To keep the 
compuations tractable, we will only be working with the exponential loss L^^P(f) on the 
weighted states. We first describe how Booster chooses the cost-matrix in each round. 
Following that we describe how it adaptively computes the weights in each round based on 
the edge of the weak classifier received. 

Choosing the cost-matrix. As discussed before, at any iteration t and state ft Booster 
assumes that it will receive an infinitesimal edge 7 in each of the remaining rounds. Since 
the step size is a function of the edge, which in turn is expected to be the same tiny value 
in each round, we may assume that the step size in each round will also be some fixed value 
rj. We are therefore in the setting of Theorem 21, which states that the parameter ij in 
the exponential loss function (35) should also be tiny to get any non-trivial bound. But 
then the loss function satisfies the conditions in Lemma 19, and by Theorem 20, the game 
theoretically optimal strategy remains the same whether we use the minimal condition 
or (C^™ , U^). When using the latter condition, the optimal choice of the cost-matrix at 
iteration t and state fj, according to (39), is 



Further, when using the condition (C^°^,\J^), the average potential of the states ft(i), ac- 
cording to (37), is given by the average loss (40) of the state times ^(7,7?)^^*, where the 
function k is defined in (38). Our goal is to choose r/ as a function of 7 so that ^(7,77) is 
as small as possible. Now, there is no lower bound on how small the edge 7 may get, and, 
anticipating the worst, it makes sense to choose an infinitesimal 7, in the spirit of (Preund, 
2001). Eq. (38) then implies that the choice of rj should also be infinitesimal. Then the 
above choice of the cost matrix becomes the following (after some rescaling): 



We have therefore derived the optimal cost matrix played by the adaptive boosting strategy, 
and we record this fact. 

Lemma 22 Consider the boosting game using the minimal weak learning condition (16). 
Then, in iteration t at state ft, the game-theoretically optimal Booster strategy chooses the 
cost matrix Ct given in (61). 

We next show how to adaptively choose the weights at- 




(60) 




(61) 
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Adaptively choosing weights. Once Weak Learner returns a weak classifier ht, Booster 
chooses the optimum weight at so that the resulting states fj = ft_i +atlht arc as favorable 
as possible, that is, minimizes the total potential of its states. By our previous discussions, 
these are proportional to the total loss given by Zt = Yli'iLi'n!i=2^^*^^'^'^~^*^^'^^ ■ 
choice of at, the difference Zt — Zt~\ between the total loss in rounds i—\ and t is given by 

(e"* - 1) ^ g/t-i(j,/itW)-/«-i(i,i) ^ L^^P(ft_i(i)) 

= (e"* -l)y4*_- (l-e""*) A*+ 

= {A\e-'^' + A*_e"*) - {A\ + A^_) , 

where denotes the set of examples that ht classifies correctly, S- the incorrectly classified 
examples, and J^_^ A*_^ denote the first and second summations, respectively. Therefore, the 
task of choosing at can be cast as a simple optimization problem minimizing the previous 
expression. In fact, the optimal value of at is given by the following closed form expression 

^t = h^(^). (62) 



2 V^- , 

With this choice of weight, one can show (with some straightforward algebra) that the total 
loss of the state falls by a factor less than 1. In fact the factor is exactly 



(1 - ct) - ^Jcl - &l (63) 

where 

ct = iAX + Al)/Zt-i, (64) 

and St is the edge of the returned classifier ht on the supplied cost- matrix C^. Notice that 
the quantity q is at most 1, and hence the factor (63) can be upper bounded by \/l — (5^. 

We next show how to compute the edge St- The definition of the edge depends on the weak 
learning condition being used, and in this case we are using the minimal condition (16). 
Therefore the edge St is the largest 7 such that the following still holds 

• Ife < max Ct • B. 

However, since Ct is the optimal cost matrix when using exponential loss with a tiny value 
of rj, we can use arguments in the proof of Theorem 20 to simplify the computation. In 
particular, eq. (56) implies that the edge St may be computed as the largest 7 satisfying 
the following simpler inequality 

St = sup{7 : Q« 1/1, < Q •U^} 

{m k '\ 
7 : Ct • Ife, < -7^^e-'^*-i(^'')--^*-i(^'^) I 
i=i 1=2 ) 

m k 

^ St = 7: Ct.lft, = -7j;^e-^'-iW)-^*-i(^'i) 

i=l 1=2 

^ X -Ct • Iht _ -Ct • Iht 



^T=iU=2e 



ft-iii,l)-ft-i{i,i) 



(65) 
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where the first step follows by expanding Ct • U^. We have therefore an adaptive strategy 
which efficiently reduces error. We record our results. 

Lemma 23 // the weight at in each round is chosen as in (62), and the edge 5t is given by 
(65), then the total loss Zf falls by the factor given in (63), which is at most y^l — . 

The choice of at in (62) is optimal, but depends on quantities other than just the edge 
6t. We next show a way of choosing at based only on dt that still causes the total loss to 
drop by a factor of y^l — Sf. 

Lemma 24 Suppose cost matrix Ct is chosen as in (61), and the returned weak classifier 
ht has edge St i. e. Ct • Iht < Ct* XJst ■ Then choosing any weight at > for ht makes the 
loss Zt at most a factor 

1 - ^(e°* - e""')<^t + ^(e"* + e""* - 2) 
of the previous loss Zt-i- In particular by choosing 

«* = ^lnri±|V (66) 



2 

the drop factor is at most \/l — 5^ . 

Proof We borrow notation from earlier discussions. The edge-condition implies 

- A\ = Ct • Ih, < Ct • Vs, = -5tZt-i =^ A\ - A'_ > StZt-i. 
On the other hand, the drop in loss after choosing ht with weight at is 

(1 - e-"*) A\ - (e°' - 1) A, 



[A\ - Al) - ( ^"+;" {A\ + Al) . 

We have already shown that — > 5tZt-\. Further, + A^_ is at most Zt-i. 
Therefore the loss drops by a factor of at least 

1 - \{e"' - e-''*)dt + ^(e"^ + e""* - 2) = ^ {(1 - ,5t)e"' + (1 + <5t)e-"*} . 

Tuning at as in (66) causes the drop factor to be at least ^/l — S^. ■ 

Algorithm 1 contains pseudocode for the adaptive algorithm, and includes both ways of 
choosing at- We call both versions of this algorithm AdaBoost.MM. With the approxi- 
mate way of choosing the step length in (67), AdaBoost.MM turns out to be identical to 
AdaBoost.M2 (Preund and Schapire, 1997) or AdaBoost.MR (Schapire and Singer, 1999), 
provided the weak classifier space is transformed in an appropriate way to be acceptable by 
AdaBoost.M2 or AdaBoost.MR. We emphasize that AdaBoost.MM and AdaBoost.M2 are 
products of very different theoretical considerations, and this similarity should be viewed 
as a coincidence arising because of the particular choice of loss function, infinitesimal edge 
and approximate step size. For instance, when the step sizes are chosen instead as in (68), 
the training error falls more rapidly, and the resulting algorithm is different. 

As a summary of all the discussions in the section, we record the following theorem. 
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Algorithm 1 AdaBoost.MM 



Require: Number of classes k, number of examples m. 

Require: Training set {(xi, yi), . . . , {xm, Um)} with yi G {1, . . . , k} and Xj G X. 

• Initialize m x k matrix /o(i, I) = for i = 1, . . . ,m, and I = 1, . . . , k. 
for t = 1 to T do 

• Choose cost matrix Ct as follows: 



'Jt-i{i,l)''ft-\(i,yi) jf / y.^ 

-V,/ e-^t-i{i,j)-ft-i{i,yi) if; = l. 



Ctii,l) = 

Receive weak classifier ht ■ X ^ {1, . . . ,k} from weak learning algorithm 



• Compute edge St as follows: 

-T.'iLiCtii,ht{xi)) 



Choose at either as 



or, for a slightly bigger drop in the loss, as 
• Compute ft as: 

h{i.l) = h-xii.l) + ott\\ht{xi) = l\. 

end for 

• Output weighted combination of weak classifiers : ^ x {1, • • • , ^} ^ defined as: 

T 

FT{x,l) = Y,ocMht{x) = l\. (69) 

• Based on F^, output a classifier Rt : X ^ {1, . . . , fc} that predicts as 

Bt{x) = argmaxFT(a;,Z). (70) 
1=1 
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Theorem 25 The boosting algorithm AdaBoost.MM, shown in Algorithm 1, is the optimal 

strategy for playing the adaptive boosting game, and is based on the minimal weak learning 
condition. Further if the edges returned in each round are . . . , 6t, then the error after T 
rounds is {k - 1) 11^=1 \/l - < (A: - 1) cxp |-(l/2) YlLi^t}- 

In particular, if a weak hypothesis space is used that satisfies the optimal weak learning 
condition (16), for some 7, then the edge in each round is large, St > 7, and therefore the 
error after T rounds is exponentially small, {k — l)e~^'^ 

The theorem above states that as long as the minimal weak learning condition is sat- 
isfied, the error will decrease exponentially fast. Even if the condition is not satisfied, the 
error rate will keep falling rapidly provided the edges achieved by the weak classifiers are 
relatively high. However, our theory so far can provide no guarantees on these edges, and 
therefore it is not clear what is the best error rate achievable in this case, and how quickly it 
is achieved. The assumptions of boostability, and hence our minimal weak learning condi- 
tion does not hold for the vast majority of practical datasets, and as such it is important to 
know what happens in such settings. In particular, an important requirement is empirical 
consistency, where we want that for any given weak classifier space, the algorithm converge, 
if allowed to run forever, to the weighted combination of classifiers that minimizes error on 
the training set. Another important criterion is universal consistency, which requires that 
the algorithm converge, when provided sufficient training data, to the classifier combination 
that minimizes error on the test dataset. In the next section, we show that AdaBoost.MM 
satisfies such consistency requirements. Both the choice of the minimal weak learning condi- 
tion as well as the setup of the adaptive game framework will play crucial roles in ensuring 
consistency. These results therefore provide evidence that game theoretic considerations 
can have strong statistical implications. 

9. Consistency of the adaptive algorithm 

The goal in a classification task is to design a classifier that predicts with high accuracy on 
unobserved or test data. This is usually carried out by ensuring the classifier fits training 
data well without being overly complex. Assuming the training and test data are reasonably 
similar, one can show that the above procedure achieves high test accuracy, or is consistent. 
Here we work in a probabilistic setting that connects training and test data by assuming 
both consist of examples and labels drawn from a common, unknown distribution. 

Consistency for multiclass classification in the probabilistic setting has been studied by 
Tewari and Bartlett (2007), who show that, unlike in the binary setting, many natural ap- 
proaches fail to achieve consistency. In this section, we show that AdaBoost.MM described 
in the previous section avoids such pitfalls and enjoys various consistency results. We begin 
by laying down some standard assumptions and setting up some notation. Then we prove 
our first result showing that our algorithm minimizes a certain exponential loss function 
on the training data at a fast rate. Next, we build upon this result and improve along two 
fronts: firstly we change our metric from exponential loss to the more relevant classification 
error metric, and secondly we show fast convergence on not just training data, but also the 
test set. For the proofs, we heavily reuse existing machinery in the literature. 
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Throughout the rest of this section we consider the version of AdaBoost.MM that picks 
weights according to the approximate rule in (67). All our results most probably hold 
with the other rule for picking weights in (68) as well, but we did not verify that. These 
results hold without any boostability requirements on the space H of weak classifiers, and 
are therefore widely applicable in practice. While we do not assume any weak learning 
condition, we will require a fully cooperating Weak Learner. In particular, we will require 
that in each round Weak Learner picks the weak classifier suffering minimum cost with 
respect to the cost matrix provided by the boosting algorithm, or equivalently achieves the 
highest edge as defined in (65). Such assumptions are both necessary and standard in the 
literature, and are frequently met in practice. 

In order to state our results, we will need to setup some notation. The space of examples 
will be denoted by X, and the set of labels by y = {1,. . . ,k}. We also fix a finite weak 
classifier space H consisting of classifiers h : X ^ y. We will be interested in functions 
F : X X y ^ M that assign a score to every example and label pair. Important examples 
of such functions are the weighted majority combinations (69) output by the adaptive 
algorithm. In general, any such combination of the weak classifiers in space H is specified 
by some weight function a : H — > M; the resulting function is denoted hy Fa : X x y ^ M., 
and satisfies: 



We will be interested in measuring the average exponential loss of such functions. To 
measure this, we introduce the risk operator: 



With this setup, we can now state our simplest consistency result, which ensures that the 
algorithm converges to a weighted combination of classifiers in the space H that achieves 
the minimum exponential loss over the training set at an efficient rate. 

Lemma 26 The risk of the predictions Ft, as defined in (69), converges to that of the 
optimal predictions of any combination of the weak classifiers in % at the rate 0{1/T): 



where C is a constant depending only on the dataset. 

A slightly stronger result would state that the average exponential loss when measured 
with respect to the test set, and not just the empirical set, also converges. The test set is 
generated by some target distribution D over example label pairs, and we introduce the 
risk/) operator to measure the exponential loss for any function F : X xy with respect 



Fa{x,l) = a{h)l [h{x) = I] . 




(71) 



risk(FT) - inf risk(F„) < - , 

a:7t— >-M 1 



(72) 



to D: 



risko(F) =E(,,,)^o ^gF(.,0-F(.,,) 
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We show this stronger result holds if the function Ft is modified to the function Ft '■ 
X xy that takes values in the range [0, — C], for some large constant C: 



Ft{x, I) = max { -C, Ft{x, I) - maxFT(x, /') \ . (73) 



Lemma 27 If Ft is as in (73), and the number of rounds T is set to = ^/rn, then its 
risko converges to the optimal value as m — > oo with high probability: 



Pr 



risko (FtJ < inf riskD(F) + O (m^^) 



where c > is some absolute constant, and the probability is over the draw of training 
examples. 

We prove Lemmas 26 and 27 by demonstrating a strong correspondence between Ad- 
aBoost.MM and binary AdaBoost, and then leveraging almost identical known consistency 
results for AdaBoost (Bartlett and Traskin, 2007). Our proofs will closely follow the expo- 
sition in Chapter 12 of (Schapire and Preund, 2012) on the consistency of AdaBoost, and 
are deferred to the appendix. 

So far we have focused on risk^), but a more desirable consistency result would state 
that the test error of the final classifier output by AdaBoost. MM converges to the Bayes 
optimal error. The test error is measured by the eiro operator, and is given by 

evTD{H)= Pr [H{x)^y]. (75) 

The Bayes optimal classifier .ffopt is a classifier achieving the minimum error among all 
possible classifying functions 

erro (i^opt) = inf erro (H) , (76) 
H \?c — >y 

and we want our algorithm to output a classifier whose err^j approaches errD{Hopt)- In 
designing the algorithm, our main focus was on reducing the exponential loss, captured by 
risk/) and risk. Unless these loss functions are aligned properly with classification error, we 
cannot hope to achieve optimal error. The next result shows that our loss functions are 
correctly aligned, or more technically Bayes consistent. In other words, if a scoring function 
F : X 3^ — > M is close to achieving optimal risk^, then the classifier H : X ^ y derived 
from it as follows: 

H{x) G argmaxF(a;, y), (77) 
also approaches the Bayes optimal error. 

Lemma 28 Suppose F is a scoring function achieving close to optimal risk 

riskz5(F) < inf riskz)(F') + e, (78) 

for some e > 0. If H is the classifier derived from it as in (77), then it achieves close to 
the Bayes optimal error 

eiTDiH) < evvDiHopt) + V2e. (79) 
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Proof The proof is similar to that of Theorem 12.1 in (Schapire and Freund, 2012), which 
in turn is based on the work by Zhang (2004) and Bartlett et al. (2006). Let p{x) = 
^^{x',y')'^D {x' = x) denote the the marginahzed probability of drawing example x from D, 
and let Py = PTf^^i yi^^D [y' = y\x' = x] denote the conditional probability of drawing label 
y given we have drawn example x. We first rewrite the difference in errors between H and 
Hopt using these probabilities. Firstly note that the accuracy of any classifier H' is given 

by 

Y,D{x,H'{x)) = J2pi^)PH'(.y 

If X' is the set of examples where the predictions of H and i^opt differ, X' = {x e X : H{x) ^ Hopt{x)}, 
then we may bound the error differences as 

eiTDiH) - eiiD{Hopt) = Yl P^^'^ {pkM=^) ~ ^^(^)) • (^^^ 

We next relate this expression to the difference of the losses. 

Notice that for any scoring function F', the risk£) can be rewritten as follows : 

risk^(F') = p{x) ^ {p-e^'(-''')-^'(-'') +pf,e^'(-'0-i^'(-/)} . 
x&X Kl' 

Denote the inner summation in curly brackets by L ^, (x) , and notice this quantity is mini- 
mized if 

^F'ix,l)-F'ix,l') = ^pf/pf,, i.e., if F'{x,l) - F'{x,l') = \ Inpf - ^ Inpf,. 

Therefore, defining F*{x, I) = ^ Inpf leads to a risk£) minimizing function F*. Furthermore, 

for any example and pair of labels 1,1', the quantity {x) is at most L^p (x), and therefore 
the difference in losses of F* and F may be lower bounded as follows: 

e>risk^(F)-risk^(F*) = ^ p(a;) ^ (4'' - 4':) 

xex ly^i' 

> ^p(x){Lf(^)'''°^*^^^-Lji^^'^°^*^^^}. (81) 
xex' 

We next study the term in the curly brackets for a fixed x. Let A and B denote H{x) and 
Hopt{x), respectively. We have already seen that Lpl = 2^jf^p^g. Further, by definition 
of Bayes optimality, > p^. On the other hand, since x G X' , we know that B ^ A, and 
hence, F{x,A) > F{x,B). Let e^(^'^)--^(^'^) = 1 + r?, for some > 0. The quantity Lp'^ 
may be lower bounded as: 

Lp^^ = pX^f,F{x,B)-F{x,A) pX^(,F{x,A)-F{x,B) 

= (l + r?)p^ + (l + r?)-V| 
> {l + ri)p\ + {l-q)p% 

= PA+PB + ViPA-PB)>PA+PB- 
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Combining we get 

Plugging back into (81) we get 



2 



and (84) holds since 



(82) 



Now we connect (80) to the previous expression as follows 
{errD(iJ) - erri:,(ifopt)}^ 

Kxex' ) 

< ( Yl ^'(^) ) ( Yl ^'(^) {Pkptix) - PH{x))j (Cauchy-Schwartz) 

^ ''Y.P^^^i^^x)-^}) (84) 

< 2e, (by (82)) 
where (83) holds since 

y p(x) = Pr \x' e X'] < 1, 



Therefore, erTc){H) — err£)(iJopt) < v2e. ■ 

Note that the classifier Ht, derived from the truncated scoring function Ft in the manner 
provided in (77), makes identical predictions to, and hence has the same err/? as, the 
classifier Ht output by the adaptive algorithm. Further, Lemma 27 seems to suggest that 
Ft satisfies the condition in (78), which, combined with our previous observation eirniH) = 
err t){Ht), would imply Ht approaches the optimal error. However, the condition (78) 
requires achieving optimal risk over all scoring functions, and not just ones achievable as a 
combination of weak classifiers in Ti. Therefore, in order to use Lemma 28, we require the 
weak classifier space to be sufficiently rich, so that some combination of the weak classifiers 
in T-L attains risk^j arbitrarily close to the minimum attainable by any function: 

inf riskc(Fa) = inf riskD(F). (85) 
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The richness condition, along with our previous arguments and Lemma 27, immediately 
imply the following result. 

Theorem 29 If the weak classifier space % satisfies the richness condition (85), and the 
number of rounds T is set to \/rn, then the error of the final classifier approaches the 
Bayes optimal error: 



where c > is some positive constant, and the probability is over the draw of training 
examples. 

A consequence of the theorem is our strongest consistency result: 

Corollary 30 Let -ffopt be the Bayes optimal classifier, and let the weak classifier space % 
satisfy the richness condition (85). Suppose m example and label pairs {(xi, yi), . . . , (x^, Vm)} 
are sampled from the distribution D, the number of rounds T is set to be y/m, and these 
are supplied to AdaBoost.MM. Then, in the limit m — >■ oo, the final classifier output 
by AdaBoost.MM achieves the Bayes optimal error almost surely: 



where the probability is over the randomness due to the draw of training examples. 

The proof of Corollary 30, based on the Borel-Cantelli Lemma, is very similar to that 
of Corollary 12.3 in (Schapire and Freund, 2012), and so we omit it. When k = 2, Ad- 
aBoost.MM is identical to AdaBoost. For Theorem 29 to hold for AdaBoost, the richness 
assumption (85) is necessary, since there are examples due to Long and Servedio (2010) 
showing that the theorem may not hold when that assumption is violated. 

Although we have seen that technically AdaBoost.MM is consistent under broad assump- 
tions, intuitively perhaps it is not clear what properties were responsible for this desirable 
behavior. We next briefly study the high level ingredients necessary for consistency in 
boosting algorithms. 

Key ingredients for consistency. We show here how both the choice of the loss function 
as well as the weak learning condition play crucial roles in ensuring consistency. If the loss 
function were not Bayes consistent as in Lemma 28, driving it down arbitrarily could still 
lead to high test error. For example, the loss employed by SAMME (Zhu ct al., 2009) does 
not upper bound the error, and therefore although it can manage to drive down its loss 
arbitrarily when supplied by the dataset discussed in Figure 1, although its error remains 
high. 

Equally important is the weak learning condition. Even if the loss function is chosen to 
be error, so that it is trivially Bayes consistent, choosing the wrong weak learning condition 
could lead to inconsistency. In particular, if the weak learning condition is stronger than 
necessary, then, even on a boostable dataset where the error can be driven to zero, the 
boosting algorithm may get stuck prematurely because its stronger than necessary demands 
cannot be met by the weak classifier space. We have already seen theoretical examples of 




(86) 




(87) 
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such datasets, and we will see some practical instances of this phenomenon in the next 
section. 

On the other hand, if the weak learning condition is too weak, then a lazy Weak Learner 
may satisfy the Booster's demands by returning weak classifiers belonging only to a non- 
boostable subset of the available weak classifier space. For instance, consider again the 
datasct in Figure 1, and assume that this time the weak classifier space is much richer, and 
consists of all possible classifying functions. However, in any round, Weak Learner searches 
through the space, first trying hypotheses hi and /12 shown in the figure, and only if neither 
satisfy the Booster, search for additional weak classifiers. In that case, any algorithm using 
SAMME's weak learning condition, which is known to be too weak and satisfiable by just 
the two hypotheses {/ii,/i2}! would only receive hi or /i2 in each round, and therefore be 
unable to reach the optimum accuracy. Of course, if the Weak Learner is extremely generous 
and helpful, then it may return the right collection of weak classifiers even with a null weak 
learning condition that places no demands on it. However, in practice, many Weak Learners 
used are similar to the lazy weak learner described since these are computationally efficient. 

To see the efi:ect of inconsistency arising from too weak learning conditions in practice, 
we need to test boosting algorithms relying on such datasets on significantly hard datasets, 
where only the strictest Booster strategy can extract the necessary service from Weak 
Learner for creating an optimal classifier. We did not include such experiments, and it will 
be an interesting empirical conjecture to be tested in the future. However, we did include 
experiments that illustrate the consequence of using too strong conditions, and we discuss 
those in the next section. 

10. Experiments 

In the final section of this paper, we report preliminary experimental results on 13 UCI 
datasets: letter, nursery, pendigits, satimage, segmentation, vowel, car, chess, connect4, forest, 
magic04, poker, aba lone. These datasets are all multiclass except for magic04, have a wide 
range of sizes, contain all combinations of real and categorical features, have different num- 
ber of examples to number of features per example ratios, and are drawn from a variety of 
real-life situations. Most sets come with prespecified train and test splits which we use; if 
not, we picked a random 4 : 1 split. Throughout this section by MM we refer to the version 
of AdaBoost.MM studied in the consistency section, which uses the approximate step size 
(67). 

There were two kinds of experiments. In the first, we took a standard implementation 
Ml of Adaboost.Ml with C4.5 as weak learner, and the Boostexter implementation MH 
of Adaboost.MH using stumps (Schapire and Singer, 2000), and compared it against our 
method MM with a naive greedy tree-searching weak-learner Greedy. The size of trees to be 
used can be specified to our weak learner, and was chosen to be the of the same order as 
the tree sizes used by Ml. The test-error after 500 rounds of boosting for each algorithm 
and dataset is bar-plotted in Figure 7. The performance is comparable with Ml and far 
better than MH (understandably since stumps are far weaker than trees), even though our 
weak-learner is very naive. The convergence rates of error with rounds of Ml and MM are 
also comparable, as shown in Figure 8 (we omitted the curve for MH since it lay far above 
both Ml and MM). 
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Figure 7: This is a plot of the final test-errors of standard implementations of Ml, MH and MM after 
500 rounds of boosting on different datasets. Both Ml and MM achieve comparable error, which is 
often larger than that achieved by MH. This is because Ml and MM used trees of comparable sizes 
which were often much larger and powerful than the decision stumps that MH boosted. 
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Figure 8: Plots of the rates at which Ml (black, dashed) and MM(red, solid) drive down test-error on 
different data-sets when using trees of comparable sizes as weak classifiers. Ml called C4.5, and MM 
called Greedy, respectively, as weak-learner. The tree sizes returned by C4.5 were used as a bound 
on the size of the trees that Greedy was allowed to return. This bound on the tree-size depended 
on the dataset, and are shown next to the dataset labels. 
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Figure 9: For this figure, Ml (black, dashed), MH(blue, dotted) and MM(red, solid) were designed to 
boost decision trees of restricted sizes. The final test-errors of the three algorithms after 500 rounds 
of boosting are plotted against the maximum tree-sizes allowed for the weak classifiers. MM achieves 
much lower error when the weak classifiers are very weak, that is, with smaller trees. 
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We next investigated how each algorithm performs with less powerful weak- learners. We 
modified MH so that it uses a tree returning a single multiclass prediction on each example. 
For MH and MM we used the Greedy weak learner, while for Ml we used a more powerful- 
variant Greedy-Info whose greedy criterion was information gain rather than error (we 
also ran Ml on top of Greedy but Greedy-Info consistently gave better results so we only 
report the latter). We tried all tree-sizes in the set {10, 20, 50, 100, 200, 500, 1000, 2000, 
4000} up to the tree-size used by Ml on C4.5 for each data-set. We plotted the error of each 
algorithm against tree size for each data-set in Figure 9. As predicted by our theory, our 
algorithm succeeds in boosting the accuracy even when the tree size is too small to meet 
the stronger weak learning assumptions of the other algorithms. More insight is provided 
by plots in Figure 10 of the rate of convergence of error with rounds when the tree size 
allowed is very small (5). Both Ml and MH drive down the error for a few rounds. But since 
boosting keeps creating harder distributions, very soon the small-tree learning algorithms 
Greedy and Greedy-Info are no longer able to meet the excessive requirements of Ml and 
MH respectively. However, our algorithm makes more reasonable demands that are easily 
met by Greedy. 



51 



I. MUKHERJEE AND R. E. SCHAPIRE 



Figure 10: A plot of how fast the test-errors of the three algorithms drop with rounds when the 
weak classifiers are trees with a size of at most 5. Algorithms Ml and MH make strong demands 
which cannot be met by the extremely weak classifiers after a few rounds, whereas MM makes gentler 
demands, and is hence able to drive down error through all the rounds of boosting. 
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11. Conclusion 

In summary, we create a new framework for studying multiclass boosting. This framework 
is very general and captures the weak learning conditions implicitly used by many earlier 
multiclass boosting algorithms as well as novel conditions, including the minimal condition 
under which boosting is possible. We also show how to design boosting algorithms relying on 
these weak learning conditions that drive down training error rapidly. These algorithms are 
the optimal strategies for playing certain two player games. Based on this game-theoretic 
approach, we also design a multiclass boosting algorithm that is consistent, i.e., approaches 
the minimum empirical risk, and under some basic assumptions, the Bayes optimal test 
error. Preliminary experiments show that this algorithm can achieve much lower error 
compared to existing algorithms when used with very weak classifiers. 

Although we can efficiently compute the game-theoretically optimal strategies under 
most conditions, when using the minimal weak learning condition, and non-convex 0-1 er- 
ror as loss function, we require exponential computational time to solve the corresponding 
boosting games. Boosting algorithms based on error are potentially far more noise tolerant 
than those based on convex loss functions, and finding efficiently computable near-optimal 
strategies in this situation is an important problem left for future work. Further, we pri- 
marily work with weak classifiers that output a single multiclass prediction per example, 
whereas weak hypotheses that make multilabcl multiclass predictions are typically more 
powerful. We believe that multilabel predictions do not increase the power of the weak 
learner in our framework, and our theory can be extended without much work to include 
such hypotheses, but we do not address this here. Finally, it will be interesting to see if 
the notion of minimal weak learning condition can be extended to boosting settings beyond 
classification, such as ranking. 

Acknowledgments 

This research was funded by the National Science Foundation under grants IIS-0325500 and 
IIS-1016029. 

Appendix 

Optimality of the OS strategy 

Here we prove Theorem 9. The proof of the upper bound on the loss is very similar to the 
proof of Theorem 2 in (Schapire, 2001). For the lower bound, a similar result is proven in 
Theorem 3 in (Schapire, 2001). However, the proof relies on certain assumptions that may 
not hold in our setting, and we instead follow the more direct lower bounding techniques 
in Section 5 of (Mukherjee and Schapire, 2010). 

We first show that the average potential of states does not increase in any round. The 
dual form of the recurrence (24) and the choice of the cost matrix in (25) together ensure 
that for each example i, 

^?«(st(z)) = inic (s*(z) + e,)-(Q(z)(0-(Ct(0,B(z)))} 

> <Pt-Ii (si(^) + - (Ctii, ht{xi)) - (Q(z), B(i))) . 

53 



I. MUKHERJEE AND R. E. SCHAPIRE 



Summing up the inequalities over all examples, we get 

m m rn 



The first two summations are the total potentials in round t + 1 and respectively, and the 
third summation is the difference in the costs incurred by the weak-classifier ht returned 
in iteration t and the baseline B. By the weak learning condition, this difference is non- 
positive, implying that the average potential docs not increase. 

Next we show that the bound is tight. In particular choose any accuracy parameter 
e > 0, and total number of iterations T, and let m be as large as in (28). We show that in 
any iteration t <T, based on Booster's choice of cost-matrix C, an adversary can choose a 
weak classifier ht G 'H'^^^ such that the weak learning condition is satisfied, and the average 
potential does not fall by more than an amount e/T. In fact, we show how to choose labels 
li,. . . ,lm such that the following hold simultaneously: 

m m 

Y.C{i,k) < E(C(i),B(z)) (88) 

i=l i=l 
m m 

E<^?2(«*(0) < '^ + E<^?-ii(«*«+e,) (89) 

1=1 1=1 

This will imply that the final potential or loss is at least e less than the bound in (26). 

We first construct, for each example i, a distribution pi G . . . ,k} such that the 

size of the support of is either 1 or 2, and 



'^':i{st{i)) = E,.p, [4^1, (s,(z) + e;)J . (90) 

To satisfy (90), by (20), we may choose pj as any optimal response of the max player in the 
minmax recurrence when the min player chooses C(i): 

Pi G argmaxlE^^p 0^^*^ (s -|- e^) | (91) 

where T'i = {p G A {1, . . . , fc} : E^^p [C(z, Z)] < (C(i), B(i))} . (92) 

The existence of p^ is guaranteed, since, by Lemma 7, the polytope Vi is non-empty for 
each i. The next result shows that we may choose pj to have a support of size 1 or 2. 

Lemma 31 There is a Pi satisfying (91) with either 1 or 2 non-zero coordinates. 

Proof Let p* satisfy (91), and let its support set be S. Let fii denote the mean cost under 
this distribution: 

tii=Ei^p* [C{i,l)]<{C{i),B{i)). 

If the support has size at most 2, then we are done. Further, if each non-zero coordinate 
I £ S of p* satisfies C{i, I) = fii, then the distribution pi that concentrates all its weight on 
the label P"'^ G S minimizing (pfSi (s -|- e^min) is an optimum solution with support of size 
1. Otherwise, we can pick labels Zf"'^, Z™™ G S such that 

C{^,lTn<l^^<C{^,lf'n■ 
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Then we may choose a distribution q supported on these two labels with mean fif. 

Ei^q [Cii, I)] = q{lTnC{i, ) + QilfnCii, Ifl = f^i- 
Choose A as follows: 



A = min 



and write p* = Aq + (1 — A)p. Then both p, q belong to the polytope Vi, and have strictly 
fewer non-zero coordinates than p*. Further, by linearity, one of q, p is also optimal. We 
repeat the process on the new optimal distribution till we find one which has only 1 or 2 
non-zero entries. ■ 

We next show how to choose the labels li, . . . ,lm using the distributions pj. For each i, 
let be the support of pi so that 

c{i,i+)<Ei^p^ [Ciz,i)]<c{i,i;). 

(When Pi has only one non-zero element, then = l~ .) For brevity, we use and p~ 
to denote pi (/^) and pi respectively. If the costs of both labels are equal, we assume 
without loss of generality that is concentrated on label l^: 

C (i, ir) - C {i, ir) =0 ^ p+= 0,p- = 1. (93) 

We will choose each label li from the set In fact, we will choose a partition 

S+,S- of the examples l,...,m and choose the label depending on which side S^, for 
^ G {— , -|-}, of the partition element i belongs to: 

ii = 4ifi&s^. 

In order to guide our choice for the partition, we introduce parameters di J b'l clS follows: 

ai = C{i,l-)-C{i,ll), 

hi = (t>T-t-i (st(0 + e,-) - 4>T-t_i (st{i) + 6;+) . 

Notice that for each example i and each sign- bit ( G {— we have the following 
relations: 



C{i,lf) = E;^pJC(i,0]-?(l-p|)ai (94) 
1 (st(i) + e^,) = Ei^p^[4>pfl{i,l)'\-ai-pi)bi. (95) 



'-'T-t- 
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Then the cost incurred by the choice of labels can be expressed in terms of the parameters 
follows: 

Y,Cii,li)+Ycii,ir) = Y.{Ei^^^[C{i,l)]-ai+pia^} 
ies+ ieS- ies+ 

+ J2{^l-Pi [C{i,l)]+Ptai} 
ieS- 

m j m 

ies+ 



i=l 



(96) 



i=l 



i&S+ 



where the first equality follows from (94), and the inequality follows from the constraint on 
Pi in (92). Similarly, the potential of the new states is given by 



ieS- 



+ 



E {^'-P. (st(i) +e/)] +ptbi} 



ieS- 



(97) 



ies+ 



i=l 



, i=l 



E<^?-a«*«)+ E^^^^^'-E^^ 



i=l 



(98) 



where the first equality follows from (95), and the last equality from an optimal choice of pj 
satisfying (90). Now, (96) and (98) imply that in order to satisfy (88) and (89), it suffices 
to choose a subset S+ satisfying 



E «i ^ E^i^"^ 



ieS+ i=l 



(99) 



We simplify the required conditions. Notice the first constraint tries to ensure that S+ 
is big, while the second constraint forces it to be small, provided the 6j are non-negative. 
However, if 6^ < for any example i, then adding this example to 5+ only helps both 
inequalities. In other words, if we can always construct a set 5+ satisfying (99) in the case 
where the bi are non-ncgativc, then wc may handle the more general situation by just adding 
the examples i with negative bi to the set 5*+ that would be constructed by considering only 
the examples {i : bi > 0}. Therefore we may assume without loss of generality that the bi 
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are non-negative. Further, assume (by relabeling if necessary) that ai, . . . ,0^' are positive 
and Om'-i-i, . . . a„i = 0, for some m' < m. By (93), wc have = for i > m! . Therefore, 
by assigning the examples m' + 1, . . . , m to the opposite partition S-, we can ensure that 

(99) holds if the following is true: 

m' 

^fli > (100) 

ie5+ i=l 

m' 

Y^hi < m^^\hi\+Y,Ptbu (101) 

where, for (101), we additionally used that, by the choice of m (28) and the bound on loss 
variation (27), we have 

TTIS 

— >0{L,T)>bi ioT i = l,...,m. 

The next lemma shows how to construct such a subset 5+, and concludes our lower bound 
proof. 

Lemma 32 Suppose ai,...,a^' are positive and bi,...,bm' are non-negative reals, and 
pf, . . . ,p^, G [0, 1] are probabilities. Then there exists a subset /S^. C {1, . . . , m'} such that 

(100) and (101) hold. 

Proof Assume, by relabeling if necessary, that the following ordering holds: 

«(1) - Kl) ^ ^ a{m') - bjm') 
a(l) a(m') ' 

Let / < m' be the largest integer such that 

m' 

ai + a2-\ \-ai<'Ypfai. (103) 

1=1 

Since the p'^ are at most 1, / is in fact at most m' — 1. We will choose S+ to be the first I+l 
examples S-^- = {1,...,/ + 1}. Observe that (100) follows immediately from the definition 
of /. Further, (101) will hold if the following is true 

m' 

bi+b2 + --- + bi K^plh, (104) 
1=1 

since the addition of one more example I+l can exceed this bound by at most 67+1 < 

max^-|^|5j|. We prove (104) by showing that the left hand side of this equation is not much 
more than the left hand side of (103). Wc first rewrite the latter summation differently. 
The inequality in (103) implies we can pick p^ , . . . ,p^, G [0, 1] (e.g., by simply scaling the 
p^'s appropriately) such that 



ai 



+ ... + ai = J^Pi^^ (1^^) 



i=l 



for i = 1, . . . , m': p^ < pi. (106) 
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By subtracting off the first / terms in the right hand side of (105) from both sides we get 



(1 - p'l)ai H h (1 - p'^)ai = Pj+iflz+i -\ h 



dm ' • 



Since the terms in the summations are non-negative, we may combine the above with the 
ordering property in (102) to get 

> P7+ia/+l ^ )+---+Pm'('m'[— • (107) 

\ 07+1 / V a,m' J 



Adding the expression 



Pjai H \-pJai 



a\ I \ 0,1 



to both sides of (107) yields 



I 



i=l ^ ^ ^ i=\ 
I I m' 



i.e. ^a,,-^hi > J^Pt^i -^Zpt^i 

1=1 i=l 1=1 1=1 

/ m' 

i.e. < (108) 

i=l 1=1 

where the last inequality follows from (105). Now (104) follows from (108) using (106) and 
the fact that the ft^'s are non-negative. ■ 

This completes the proof of the lower bound. 
Consistency proofs 

Here we sketch the proofs of Lemmas 26 and 27. Our approach will be to relate our algorithm 
to AdaBoost and then use relevant known results on the consistency of AdaBoost. We first 
describe the correspondence between the two algorithms, and then state and connect the 
relevant results on AdaBoost to the ones in this section. 

For any given multiclass dataset and weak classifier space, wc will obtain a transformed 
binary dataset and weak classifier space, such that the run of AdaBoost. MM on the original 
dataset will be in perfect correspondence with the run of AdaBoost on the transformed 
dataset. In particular, the loss and error on both the training and test set of the combined 
classifiers produced by our algorithm will be exactly equal to those produced by AdaBoost, 
while the space of functions and classifiers on the two datasets will be in correspondence. 

Intuitively, we transform our multiclass classification problem into a single binary clas- 
sification problem in a way similar to the all-pairs multiclass to binary reduction. A very 
similar reduction was carried out by Freund and Schapire (1997). Borrowing their termi- 
nology, the transformed dataset roughly consists of mislabel triples {x, y, I) where y is the 
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true label of the example and / is an incorrect example. The new binary label of a mislabel 
triple is always —1, signifying that / is not the true label. A multiclass classifier becomes 
a binary classifier that predict ±1 on the mislabel triple {x, y, I) depending on whether the 
prediction on x matches label I; therefore error on the transformed binary dataset is low 
whenever the multiclass accuracy is high. The details of the transformation are provided in 
Figure 11. 

Some of the properties between the functions and their transformed counterparts are 
described in the next lemma, showing that we are essentially dealing with similar objects. 

Lemma 33 The following are identities for any scoring function F : X x y ^ M. and 
weight function a : H — >■ M.- 



risk(F„) = risk (^Fsj (109) 

lisko {F) = risko (-F) . (110) 

The proofs involve doing straightforward algebraic manipulations to verify the identities 
and are omitted. 

The next lemma connects the two algorithms. We show that the scoring function output 
by AdaBoost when run on the transformed dataset is the transformation of the function 
output by our algorithm. The proof again involves tedious but straightforward checking of 
details and is omitted. 

Lemma 34 If AdaBoost. MM produces scoring function F^ when run for T rounds with the 
training set S and weak classifier space T-L, then AdaBoost produces the scoring function 
F^ when run for T rounds with the training set S and space T-L. We assume that for both 
the algorithms, Weak Learner returns the weak classifier in each round that achieves the 
maximum edge. Further we consider the version of AdaBoost. MM that chooses weights 
according to the approximate rule (67). 

We next state the result for AdaBoost corresponding to Lemma 26 , which appears in 
(Mukherjee et al., 2011). . 

Lemma 35 [Theorem 8 in (Mukherjee et al., 2011)] Suppose AdaBoost produces the scor- 
ing function F^ when run for T rounds with the training set S and space %. Then 

risk (f^ < „inf risk (f^ + C/T, (111) 
where the constant C depends only on the dataset. 

The previous lemma, along with (109) immediately proves Lemma 26. The result for Ad- 
aBoost corresponding to Lemma 27 appears in (Schapire and Frcund, 2012). 

Lemma 36 (Theorem 12.2 in (Schapire and Freund, 2012)) Suppose AdaBoost pro- 
duces the scoring function F when run for T = y/m rounds with the training set S and space 
it. Then 



Pr 



riskc [f] < _inf riskD(i^') + O [m"") 



>^-—o, (112) 



VP? 



where the constant C depends only on the dataset. 

The proof of Lemma 27 follows immediately from the above lemma and (110). 
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AdaBoost.MM 


AdaBoost 


Labels 


y = {i,...,k} 


y = {-i,+i} 


Examples 


X 


X = Xx{{yxy)\{iy,y):y€y}) 


Weak classifiers 




h: X ^ {-1, 0, -hi}, where 

h{x, y,l) = l [h{x) =l]-l [h{x) = y] 


Classifier space 


n 




Scoring function 


F :X xy^R 


F : X ^R where 

F{x,y,l)=F{x,l)-F{x,y) 


Clamped func- 
tion 


F{x,y) = 

max{— C, F(x,Z) — max;/ Ft{x,1')} 


Fix,y,l) = F{x,y,l),if\F{x,y,l)\<C 

Fix,y,l)=C, ii\F{x,y,l)\>C 


Classifier weights 




a : H —>R where 

5 (^h^ = a{h) 


Combined hypo- 
thesis 


Fa where 

Faix,l) = j:,^^^a{h)l[h{x) = l] 


Fa where 

Faix, y, I) = J2hen " (^) ^(*' V' 


Training set 


S = {{xi, Pi) ■■ 1 <i <m} 


S = 

{{{xi, yi, I), ■ ^ = -hi yiA < i < m} 


Test distribution 


D over X x y 


D over X xY where 

b{{x,y,l),-l) = D{x,y)/{k - 1) 
D{{x,y,l),+l)=Q 


Empirical risk 


risk(F) = 

1 v^m f,F(_Xi,l)-F(xi,yi) 


m(k-l) Z^i=l ^l^Vi ^ 


Test risk 


riskD(F) = 


Tisk^ {f^ = 



Figure 11: Details of transformation between AdaBoost.MM and AdaBoost. 
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